CXL Introduction

What is CXL

CXL (Compute Express Link) is a new type of open interconnect standard designed for high-speed communication between processors and high-performance endpoint devices such as GPUs, FPGAs, or other accelerators.

When discussing CXL, it is indispensable to mention the hierarchical storage diagram in computer architecture. In the past, there was a significant gap between HDD disks and memory, but the emergence of SSDs and NVMe devices gradually bridged this gap. Traditional databases have become less sensitive to this difference because the bottleneck of the system has shifted to the CPU side. Therefore, in recent years, everyone has been focusing on column storage, vectorization, and other technologies to reduce memory usage. For many applications, although the latency of NVMe has met the requirements, throughput remains a significant bottleneck, and it cannot completely replace memory. Model training and vector data are very typical scenarios in this regard.

CXL effectively addresses this problem. By mounting the device on the PCIe bus, CXL establishes an interconnection between the device and the CPU, realizing the separation of storage and computation.

CXL Protocols

CXL comprises three different protocols - CXL.io, CXL.cache, and CXL.mem, each serving a different purpose.

  • CXL.io is built on the physical and link layers of the PCI Express (PCIe) infrastructure. It ensures backward compatibility with the PCIe ecosystem, thus leveraging the advantages of its wide deployment. When a CXL device is connected to a host, these operations are carried out through the CXL.io protocol. It handles input/output operations, and allows discovery, configuration, and basic management of devices.

  • CXL.cache provides cache coherency between the host processor cache hierarchy and the memory on CXL devices. This coherency allows the host and device to share resources, thereby reducing latency and improving data access rates. This is crucial for high-performance computing workloads such as big data processing and machine learning, which often require frequent access to large amounts of data.

  • CXL.mem allows the host processor to access a CXL device’s memory at high speed and with low latency. This mechanism allows the host to effectively utilize the device’s memory as a pool of resources, making it highly suitable for applications that require intensive data exchange.

Specifically, CXL mainly defines three types of devices:

  • CXL Type 1 Device: This type of device includes accelerators and smart network cards. They access host memory via the CXL.cache protocol, maintaining a local cache that’s coherent with the host memory.

  • CXL Type 2 Device: This category includes devices such as GPUs and FPGAs, which have their own memory such as DDR and HBM. These devices can access the host memory directly like Type 1 devices. Additionally, they can use the CXL.mem protocol to allow the host to access their local address space.

  • CXL Type 3 Device: These are memory expansion devices that allow the host to access their memory buffer consistently via CXL.mem transactions. Type 3 CXL devices can be used to increase memory capacity and bandwidth.

Maximizing CXL Efficiency

To fully utilize CXL memory, several crucial factors must be taken into account:

  1. Consider the memory hierarchy fully and use RAM or even Cache as the buffer for CXL.

  2. Push down computations as much as possible to reduce the amount of data that the bus needs to handle.

  3. Take the latency of CXL fully into account and design pipelines or use prefetching techniques to reduce the impact of latency on throughput.

  4. Fully exploit the advantages of large memory to minimize the performance impact brought by data exchange in distributed systems.

References

  1. https://zhuanlan.zhihu.com/p/646858357
  2. https://jia.je/hardware/2022/11/20/cxl-notes/

MQTT Publish/Subscribe

Publish

Each message must include a topic, through which the broker delivers the message to clients interested in that topic. The specific content of the message is passed in binary form. MQTT is agnostic to the content of the message, and the client can send data in any format, such as binary data, text data, XML data, or JSON data, etc.

Format

The topic is a hierarchical structure composed of strings separated by slashes, for example, “home/bedroom/temperature”.

Quality of Service

Quality of Service determines the guarantee level for message delivery to the target. Quality of Service is divided into three levels: 0, 1, 2. 0 means the message is delivered at most once, and if it fails, no retries will be made. 1 means the message is delivered at least once, and if the recipient does not explicitly receive it (returns acked), it will continue to retry sending. 2 means the message is delivered exactly once.

Retain Flag

The retain flag determines whether the message is retained as the latest message for this topic. When a new client subscribes to this topic, it will receive the latest retained message for this topic. For each topic, there can be at most one retained message, but there may also be none.

Message Payload

The message payload is the specific content of the message. MQTT is unaware of the content of the message, so users can send any message.

Duplicate Field

When the quality of service of the message is greater than 0, this field is set when the message is retried.

QoS Levels

MQTT supports three Quality of Service (QoS) levels. They are defined as follows:

  1. QoS 0: At most once delivery
    This is the lowest level of service. A message is delivered at most once, and it might not be delivered at all if network disruptions occur. The message is sent from the sender (publisher) to the receiver (subscriber) without any confirmation message. There is no retransmission of the message.

  2. QoS 1: At least once delivery
    In this level of service, a message is assured to be delivered at least once to the receiver. After the sender sends the message, it stores a copy of the message until it receives a PUBACK message from the receiver. If the sender does not receive a PUBACK message within a certain period, it will resend the message.

  3. QoS 2: Exactly once delivery
    This is the highest level of service, where a message is assured to be delivered exactly once. This is achieved using a four-step handshake process:

    • The sender sends the message and keeps a copy of it. The message is marked as “unconfirmed”.
    • The receiver responds with a PUBREC message to acknowledge receipt of the message.
    • The sender receives the PUBREC message, removes the “unconfirmed” mark from the stored message, and responds with a PUBREL message.
    • Finally, the receiver responds with a PUBCOMP message to confirm that it has processed the PUBREL message. The sender can now safely delete the message from its storage.

Each level of service has different trade-offs in terms of network traffic, latency, and complexity. You should choose the appropriate QoS level based on the specific requirements of your application.

Subscribe

If no client subscribes to a topic, any messages published to that topic won’t be received by any client. Clients need to send a subscription request to the broker in order to subscribe to the corresponding topic.

Format

Packet Identifier

This is a unique identifier for each SUBSCRIBE message. Both the broker and client maintain their own Packet Identifier for each ongoing conversation. The identifier doesn’t need to be globally unique, but it does need to be unique within the scope of the client-broker communication session.

Subscription List

A single SUBSCRIBE message can request multiple topic subscriptions. Each subscription request needs to include the topic to be subscribed to and the desired Quality of Service (QoS) level. The topic string in the SUBSCRIBE packet can include wildcard characters. If the same topic is subscribed to with different QoS levels (i.e., overlapping subscriptions), the broker will deliver messages to the client at the highest QoS level that has been granted.

Subscription Acknowledgement

After the client requests to subscribe to a topic, the broker will respond with a SUBACK.

Format

The message includes a Packet Identifier that matches the one in the subscription request, as well as a set of return codes, as shown below:

Packet Identifier

This Packet Identifier should match the one in the corresponding subscription request.

Return Codes

The return codes correspond to the QoS-topic list in the subscription request, confirming the result of each subscription one-to-one. If successful, the corresponding Quality of Service (0/1/2) will be returned. If the subscription fails, the return code will be 0x80 (128 in decimal).

After the client initiates a subscription and receives a successful subscription acknowledgement, this client will be able to normally receive any subsequent messages sent to that topic.

Unsubscribe

The UNSUBSCRIBE packet is as follows, mainly containing a Packet Identifier and a list of topics to be unsubscribed:

Unsubscribe Acknowledgement

The return for an UNSUBSCRIBE request is an UNSUBACK message that only contains a Packet Identifier matching the one in the UNSUBSCRIBE request. An UNSUBACK is sent regardless of whether the topic was previously subscribed to or not.

Conclusion

MQTT message delivery is implemented through subscribing to specific topics, then publishing messages to those topics.

There’s no need to create and maintain topics before publishing, nor worry about whether there are clients subscribing to specific topics.

The Publish/Subscribe model decouples publishers and subscribers, making it easier to arrange various business scenarios, such as implementing grouping, broadcasting, etc.

However, the Publish/Subscribe model also brings a challenge: if the publisher wishes to be aware of the subscriber’s receipt of a message, this can only be accomplished at the application layer. For example, after a subscriber receives a message, it can publish a confirmation message to the publisher through another topic.

eBPF Introduction

What is eBPF

eBPF (extended Berkeley Packet Filter) is a virtual machine that runs within the kernel. It allows the extension of kernel functionality in a safe and efficient manner without modifying kernel code or loading additional kernel modules. It is capable of running BPF programs, into which users can inject as needed for execution within the kernel. These programs adhere to a specific instruction set provided by eBPF, must follow certain rules, and only safe programs are allowed to run.

The use of eBPF is on the rise, with an increasing number of eBPF programs being applied. For instance, replacing iptables rules with eBPF allows packets sent from applications to be directly forwarded to the socket of the recipient, effectively handling data packets by shortening the data path and accelerating the data plane.

eBPF Core Principles

The architecture diagram of eBPF is as follows:

eBPF is divided into two parts: programs running in user space and programs running in kernel space. The user space program is responsible for loading the BPF bytecode into the eBPF virtual machine in the kernel space, and reading various event information and statistical information returned by the kernel when needed. The BPF virtual machine in the kernel is responsible for executing specific events in the kernel. If data transmission is required, the execution results are sent to the user space through the BPF map or perf-events in the perf buffer. The whole process is as follows:

  1. The written BPF program will be compiled into BPF bytecode by tools such as Clang, LLVM, etc. (because the BPF program is not a regular ELF program, but bytecode running in a virtual machine). The eBPF program will also include configured event sources, which are actually some hooks that need to be mounted.

  2. The loader will load it into the kernel via the eBPF system call before the program runs. At this time, the verifier will verify the safety of the bytecode, such as verifying that the number of loops must end within a limited time. Once the verification is passed and the mounted event occurs, the logic of the bytecode will be executed in the eBPF virtual machine.

  3. (Optional) Output each event individually, or return statistical data and call stack data through the BPF map, and transmit it to the user space.

eBPF supports a number of major probes, such as static tracing of socket、tracepoint、USDT, and dynamic tracing of kprobe, uprobe, etc.

Dynamic Tracing

eBPF provides:

  • kprobe/kretprobe for the kernel, where k = kernel
  • uprobe/uretprobe for applications, where u = userland

These are used to detect information at the entry and return (ret) points of functions.

kprobe/kretprobe can probe most kernel functions, but for security reasons, some kernel functions do not allow probe installation, which could lead to failure in tracing.

uprobe/uretprobe are mechanisms to implement dynamic tracing of userland programs. Similar to kprobe/kretprobe, the difference is that the traced functions are in user programs.

Dynamic tracing technology relies on the symbol table of the kernel and applications. For those inline or static functions, probes cannot be installed directly, and they need to be implemented through offset. The nm or strings command can be used to view the symbol table of the application.

The principle of dynamic tracing technology is similar to GDB. When a probe is installed on a certain code segment, the kernel will copy the target position instruction and replace it with an int3 interrupt. The execution flow jumps to the user-specified probe handler, then executes the backed-up instruction. If a ret probe is also specified at this time, it will be executed. Finally, it jumps back to the original instruction sequence.

Next, let’s see how to perform dynamic tracing. First, write a main.go test code:

1
2
3
4
5
6
7
8
9
package main

func main() {
println(sum(3, 3))
}

func sum(a, b int) int {
return a + b
}

Next, disable inline optimization and compile the code by executing the go build -gcflags="-l" ./main.go command. If inline optimization is enabled, it is likely that the Go compiler will eliminate function calls during compilation, so eBPF will not be able to find the probe corresponding to the function.

The next step is to write a bpftrace script main.pt:

1
2
3
4
5
6
7
8
BEGIN{
printf("Hello!\n");
}
uprobe:./main:main.sum {printf("a: %d b: %d\n", reg("ax"), reg("bx"))}
uretprobe:./main:main.sum {printf("retval: %d\n", retval)}
END{
printf("Bye!\n");
}

Finally, execute bpftrace to monitor this function call, run the bpftrace main.pt command, then press Ctl+C to exit, and get the following output:

1
2
3
4
Hello!
a: 3 b: 3
retval: 6
^CBye!

Static Tracing

“Static” means that the probe’s position and name are hardcoded in the code and are determined at compile time. The implementation principle of static tracing is similar to callbacks: it is executed when activated, and not executed when deactivated, making it more performant than dynamic tracing. Among them:

  • tracepoint is in the kernel
  • USDT (Userland Statically Defined Tracing) is in the application

Static tracing has already included probe parameter information in the kernel and applications, and you can directly access function parameters through args->parameter_name. You can check the parameter information of tracepoint through bpftrace -lv, for example:

1
2
3
4
5
6
7
8
bpftrace -lv tracepoint:syscalls:sys_enter_openat
# Output:
# tracepoint:syscalls:sys_enter_openat
# int __syscall_nr;
# int dfd;
# const char * filename;
# int flags;
# umode_t mode;

Static tracing accesses the filename parameter of sys_enter_openat through args->filename:

1
2
3
4
5
6
7
8
9
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
# Output:
# Attaching 1 probe...
# uwsgi /proc/self/stat
# uwsgi /proc/self/fd
# uwsgi /proc/self/statm
# uwsgi /proc/loadavg
# uwsgi /proc/self/io
# ...

Here, comm represents the name of the parent process.

gRPC Introduction

gRPC Overview

gRPC is a language-neutral RPC framework developed and open-sourced by Google, currently supporting C, Java, and Go languages. The C version supports languages such as C, C++, Node.js, C#. In gRPC, client applications can directly call methods on a server on a different machine as if they were calling local methods.

The simple steps to use gRPC are as follows, taking Go language as an example:

  1. Write the proto file and use protoc to generate the .pb.go file.
  2. On the server side, define a Server, create a Function to implement the interface -> net.Listen -> grpc.NewServer() -> pb.RegisterXXXServer(server, &Server{}) -> server.Serve(listener).
  3. On the client side, grpc.Dial to create a gRPC connection -> pb.NewXXXClient(conn) to create a client -> context.WithTimeout to set a timeout -> call the interface with client.Function -> If it is stream transmission, loop to read data.

gRPC Concepts

Stream

Unary RPC

The client sends a request to the server and gets a response from the server, just like a normal function call.

Server streaming RPC

The client sends a request to the server and can get a data stream to read a series of messages. The client reads from the returned data stream until there are no more messages.
The server needs to send messages into the stream, for example:

1
2
3
4
5
6
7
8
for n := 0; n < 5; n++ {
err := server.Send(&pb.StreamResponse{
StreamValue: req.Data + strconv.Itoa(n),
})
if err != nil {
return err
}
}

The client gets a stream transmission object ‘stream’ through the grpc call and needs to loop to receive data, for example:

1
2
3
4
5
6
7
8
9
10
11
12
for {
res, err := stream.Recv()
// Determine whether the message stream has ended
if err == io.EOF {
break
}
if err != nil {
log.Fatalf("ListStr get stream err: %v", err)
}
// Print the return value
log.Println(res.StreamValue)
}

Client streaming RPC

The client writes and sends a series of messages to the server using a provided data stream. Once the client finishes writing messages, it waits for the server to read these messages and return a response.

The server uses stream.Recv() to loop to receive the data stream, and SendAndClose indicates that the server has finished receiving messages and sends a correct response to the client, for example:

1
2
3
4
5
6
7
8
9
10
11
for  {
res,err := stream.Recv()
// Message reception ends, send the result, and close
if err == io.EOF {
return stream.SendAndClose(&proto.UploadResponse{})
}
if err !=nil {
return err
}
fmt.Println(res)
}

The client needs to call CloseAndRecv when it has finished sending data, for example:

1
2
3
4
5
6
7
8
9
10
11
12
13
for i := 1; i <= 10; i++ {
img := &proto.Image{FileName:"image"+strconv.Itoa(i),File:"file data"}
images := &proto.StreamImageList{Image:img}
err := stream.Send(images)
if err != nil {
ctx.JSON(map[string]string{
"err": err.Error(),
})
return
}
}
// Finish sending, close and get the message returned by the server
resp, err := stream.CloseAndRecv()

Bidirectional streaming RPC

Both sides can separately send a series of messages via a read-write data stream. These two streams operate independently, so the client and server can read and write in any order they wish, for example: the server can wait for all client messages before writing a response, or it can read a message then write a message, or use other combinations of reading and writing.

The server sends messages while receiving messages, for example:

1
2
3
4
5
6
7
8
9
10
11
for {
res, err := stream.Recv()
if err == io.EOF {
return nil
}
err = stream.Send(&proto.StreamSumData{Number: int32(i)})
if err != nil {
return err
}
i++
}

The client needs a flag to disconnect, CloseSend(), but the server doesn’t need it because the server disconnects implicitly. We just need to exit the loop to disconnect, for example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
for i := 1; i <= 10; i++ {
err = stream.Send(&proto.StreamRequest{})
if err == io.EOF {
break
}
if err != nil {
return
}
res, err := stream.Recv()
if err == io.EOF {
break
}
if err != nil {
return
}
log.Printf("res number: %d", res.Number)
}
stream.CloseSend()

Synchronous

A Channel provides a connection established with the host and port of a specific gRPC server. A Stub is created based on the Channel, and the RPC request can be actually called through the Stub.

Asynchronous based on CQ

  • CQ: Notification queue for completed asynchronous operations
  • StartCall() + Finish(): Create asynchronous tasks
  • CQ.next(): Get completed asynchronous operations
  • Tag: Identifiers marking asynchronous actions

Multiple threads can operate on the same CQ. CQ.next() can receive not only the completion events of the current request being processed but also the events of other requests. Suppose the first request is waiting for its reply data transmission to complete, and a new request arrives. CQ.next() can get the events generated by the new request and start processing the new request in parallel without waiting for the first request’s transmission to complete.

Asynchronous based on Callback

On the client side, send a single request, when calling the Function, in addition to passing the pointers of Request and Reply, a callback function receiving Status is also needed.

On the server side, the Function doesn’t return a status, but a ServerUnaryReactor pointer. Get the reactor through CallbackServerContext and call the Finish function of the reactor to handle the return status.

Context

  • Transfers some custom Metadata between the client and server.
  • Similar to HTTP headers, it controls call configurations such as compression, authentication, and timeout.
  • Assists observability, such as Trace ID.

gRPC Communication Protocol

The gRPC communication protocol is based on standard HTTP/2 design, supports bidirectional streams, multiplexing of single TCP (an HTTP request can be initiated in advance without waiting for the result of the previous HTTP request, and multiple requests can share the same HTTP connection without interfering with each other) and features such as message header compression and server push. These features make gRPC more power-saving and network traffic-saving on mobile devices.

gRPC Serialization Mechanism

Introduction to Protocol Buffers

gRPC serialization supports Protocol Buffers. ProtoBuf is a lightweight and efficient data structure serialization method, ensuring the high performance of gRPC calls. Its advantages include:

  • The volume of ProtoBuf after serialization is much smaller than JSON, XML, and the speed of serialization/deserialization is faster.
  • Supports cross-platform and multi-language.
  • Easy to use because it provides a set of compilation tools that can automatically generate serialization and deserialization boilerplate code.

However, ProtoBuf is a binary protocol, the readability of the encoded binary data stream is poor, and debugging is troublesome.

The scalar value types supported by ProtoBuf are as follows:

Why is ProtoBuf fast?

  • Because each field is stored continuously in the form of tag+value, the tag is a number, usually only occupying one byte, and the value is the value of the field, so there are no redundant characters.
  • In addition, for relatively small integers, ProtoBuf defines a Varint variable integer, which does not need to be stored in 4 bytes.
  • If the value is of string type and the specific length of the value cannot be known from the tag, ProtoBuf will add a leg field between the tag and the value to record the length of the string, so that string matching operations do not need to be performed, and the parsing speed is very fast.

Definition of IDL file

Define the data structure of RPC requests and responses in the proto file according to the syntax of Protocol Buffers, for example:

1
2
3
4
5
6
7
8
9
10
11
12
syntax = "proto3";
option go_package = "../helloworld";
package helloworld;
service Greeter {
rpc SayHello (HelloRequest) returns (HelloReply) {}
}
message HelloRequest {
string name = 1;
}
message HelloReply {
string message = 1;
}

Here, syntax proto3 indicates the use of version 3 of Protocol Buffers. There are many changes in syntax between v3 and v2, so pay special attention when using it. go_package indicates the storage path of the generated code (package path). The data structure is defined by the message keyword, and the syntax of the data structure is:

Data_type Field_name = Tag

The message supports nesting, that is, A message references B message as its own field, which represents the aggregation relationship of objects, that is, A object aggregates (references) B object.

For some common data structures, such as common Header, you can define the proto file of the common data structure separately, and then import it for use, for example:

import "other_protofile.proto";

Import also supports cascading references, that is, a.proto imports b.proto, b.proto imports c.proto, then a.proto can directly use the message defined in c.proto.

References

  1. https://www.zhihu.com/zvideo/1427014658797027328
  2. https://zhuanlan.zhihu.com/p/389328756
  3. http://doc.oschina.net/grpc

RDMA Introduction

RDMA (Remote Direct Memory Access) refers to remote direct memory access, which is a method of transferring data in a buffer between two applications over a network.

  • Remote: Data is transferred over a network with remote machines.
  • Direct: Without the participation of the kernel, all information related to sending transmissions is offloaded to the network card.
  • Memory: Data is transferred directly between user space virtual memory and the network card without involving the system kernel, with no additional data movement or copying.
  • Access: Operations such as send, receive, read, write, atomic, etc.

RDMA is different from traditional network interfaces because it bypasses the operating system kernel. This gives programs that have implemented RDMA the following characteristics:

  1. Absolute minimum latency
  2. Highest throughput
  3. Smallest CPU footprint (that is, areas where CPU involvement is minimized)

RDMA Working Principles

During the RDMA communication process, for both sending and receiving, and read/write operations, the network card directly transfers data with the memory region that has already been registered for data transfer. This process is fast, does not require CPU participation, and the RDMA network card takes over the work of the CPU, saving resources for other calculations and services.

The working process of RDMA is as follows:

  1. When an application performs an RDMA read or write request, it doesn’t perform any data copying. Under the condition that no kernel memory is required, the RDMA request is sent from the application running in user space to the local network card.
  2. The network card reads the content of the buffer and transmits it to the remote network card over the network.
  3. The RDMA information transmitted over the network includes the virtual memory address of the target machine and the data itself. The completion of the request can be completely handled in user space (by polling the RDMA completion queue in user space). RDMA operations enable applications to read data from or write data to the memory of a remote application.

Therefore, RDMA can be simply understood as the use of relevant hardware and network technology, allowing the network card to directly read and write the memory of a remote server, ultimately achieving high bandwidth, low latency, and low resource utilization effects. The application does not need to participate in the data transmission process, it only needs to specify the memory read/write address, start the transmission, and wait for the transmission to complete.

RDMA Data Transmission

  1. RDMA Send/Recv
    This is similar to TCP/IP’s send/recv, but different in that RDMA is based on a message data transfer protocol (not a byte stream transfer protocol), and all packet assemblies are done on RDMA hardware. This means that the bottom 4 layers of the OSI model (Transport Layer, Network Layer, Data Link Layer, Physical Layer) are all completed on RDMA hardware.

  2. RDMA Read
    The essence of RDMA read operation is a Pull operation, pulling data from remote system memory back to local system memory.

  3. RDMA Write
    The essence of RDMA write operation is a Push operation, pushing data from local system memory to remote system memory.

  4. RDMA Write with Immediate Data (RDMA write operation supporting immediate data)
    RDMA write operation supporting immediate data essentially pushes out-of-band data to the remote system, which is similar to out-of-band data in TCP. Optionally, an Immediate 4-byte value can be sent along with the data buffer. This value is presented as part of the receipt notice to the receiver and is not included in the data buffer.

RDMA Programming Basics

To use RDMA, we need a network card that supports RDMA communication (i.e., implements the RDMA engine). We call this card an HCA (Host Channel Adapter). Through the PCIe (peripheral component interconnect express) bus, the adapter creates a channel from the RDMA engine to the application’s memory. A good HCA will implement all the logic needed for the executed RDMA protocol on hardware. This includes packetization, reassembly as well as traffic control and reliability assurance. Therefore, from the perspective of the application, it only needs to handle all the buffers.

As shown in the above figure, in RDMA programming, we use the command channel to call the kernel mode driver to establish the data channel, which allows us to completely bypass the kernel when moving data. Once this data channel is established, we can directly read and write the data buffer. The API to establish a data channel is an API called verbs. The verbs API is maintained by a Linux open-source project called the Open Fabrics Enterprise Distribution (OFED).

Key Concepts

RDMA operation starts with memory operation. When you operate on memory, you are telling the kernel that this segment of memory is occupied by your application. So, you tell the HCA to address on this segment of memory and prepare to open a channel from the HCA card to this memory. We call this action registering a memory region MR (Memory Region). When registering, you can set the read and write permissions of the memory region (including local write, remote read, remote write, atomic, and bind). The Verbs API ibv_reg_mr can be used to register MR, which returns the remote and local keys of MR. The local key is used for the local HCA to access local memory. The remote key is provided to the remote HCA to access local memory. Once the MR is registered, we can use this memory for any RDMA operation. In the figure below, we can see the registered memory region (MR) and the buffer located within the memory region used by the communication queue.

RDMA communication is based on a collection of three queues SQ (Send Queue), RQ (Receive Queue), and CQ (Completion Queue). The Send Queue (SQ) and Receive Queue (RQ) are responsible for scheduling work, they are always created in pairs, called Queue Pair (QP). The Completion Queue (CQ) is used to send notifications when instructions placed on the work queue are completed.

When a user places instructions on the work queue, it means telling the HCA which buffers need to be sent or used to receive data. These instructions are small structures, called Work Requests (WR) or Work Queue Elements (WQE). A WQE mainly contains a pointer to a buffer. A WQE placed in the Send Queue (SQ) contains a pointer to a message to be sent; a pointer in a WQE placed in the Receive Queue points to a buffer, which is used to store the message to be received.

RDMA is an asynchronous transmission mechanism. Therefore, we can place multiple send or receive WQEs in the work queue at once. The HCA will process these WQEs as quickly as possible in order. When a WQE is processed, the data is moved. Once the transmission is completed, the HCA creates a Completion Queue Element (CQE) with a successful status and places it in the Completion Queue (CQ). If the transmission fails for some reason, the HCA also creates a CQE with a failed status and places it in the CQ.

Example (Send/Recv)

Step 1: Both system A and B create their own QPs and CQs, and register the corresponding memory regions (MR) for the upcoming RDMA transfer. System A identifies a buffer, the data of which will be moved to system B. System B allocates an empty buffer to store data sent from system A.

Step 2: System B creates a WQE and places it in its Receive Queue (RQ). This WQE contains a pointer, which points to a memory buffer to store received data. System A also creates a WQE and places it in its Send Queue (SQ), the pointer in the WQE points to a memory buffer, the data of which will be transmitted.

Step 3: The HCA on system A always works on hardware, checking if there are any WQEs in the send queue. The HCA will consume the WQE from system A and send the data in the memory region to system B as a data stream. When the data stream starts to arrive at system B, the HCA on system B consumes the WQE from system B and puts the data into the designated buffer. The data stream transmitted on the high-speed channel completely bypasses the operating system kernel.

Note: The arrows on the WQE represent pointers (addresses) to user space memory. In receive/send mode, both parties need to prepare their own WQEs (WorkQueue) in advance, and the HCA will write (CQ) after completion.

Step 4: When the data movement is completed, the HCA creates a CQE. This CQE is placed in the Completion Queue (CQ), indicating that data transmission has been completed. The HCA creates a CQE for each consumed WQE. Therefore, placing a CQE in the completion queue of system A means that the send operation of the corresponding WQE has been completed. Similarly, a CQE will also be placed in the completion queue of system B, indicating that the receive operation of the corresponding WQE has been completed. If an error occurs, the HCA will still create a CQE. The CQE contains a field to record the transmission status.

In IB or RoCE, the total time to transmit data in a small buffer is about 1.3µs. By simultaneously creating a lot of WQEs, data stored in millions of buffers can be transmitted in one second.

RDMA Operation Details

In RDMA transfer, Send/Recv is a bilateral operation, i.e., it requires the participation of both communicating parties, and Recv must be executed before Send so that the other party can send data. Of course, if the other party does not need to send data, the Recv operation can be omitted. Therefore, this process is similar to traditional communication. The difference lies in RDMA’s zero-copy network technology and kernel bypass, which results in low latency and is often used for transmitting short control messages.

Write/Read is a unilateral operation, as the name suggests, read/write operations are executed by one party. In actual communication, Write/Read operations are executed by the client, and the server does not need to perform any operations. In RDMA Write operation, the client pushes data directly from the local buffer into the continuous memory block in the remote QP’s virtual space (physical memory may not be continuous). Therefore, it needs to know the destination address (remote addr) and access rights (remote key). In RDMA Read operation, the client directly fetches data from the continuous memory block in the remote QP’s virtual space and pulls it into the local destination buffer. Therefore, it needs the memory address and access rights of the remote QP. Unilateral operations are often used for bulk data transfer.

It can be seen that in the unilateral operation process, the client needs to know the remote addr and remote key of the remote QP. These two pieces of information can be exchanged through Send/Recv operations.

RDMA Unilateral Operation (READ/WRITE)

READ and WRITE are unilateral operations, where only the source and destination addresses of the information need to be clearly known at the local end. The remote application does not need to be aware of this communication, and the reading or writing of data is completed through RDMA between the network card and the application Buffer, and then returned to the local end by the remote network card as encapsulated messages.

For unilateral operations, take storage in the context of a storage network as an example, the READ process is as follows:

  1. First, A and B establish a connection, and the QP has been created and initialized.
  2. The data is archived at B’s buffer address VB. Note that VB should be pre-registered with B’s network card (and it is a memory region) and get the returned remote key, which is equivalent to the permission to operate this buffer with RDMA.
  3. B encapsulates the data address VB and key into a dedicated message and sends it to A, which is equivalent to B handing over the operation right of the data buffer to A. At the same time, B registers a WR in its WQ to receive the status returned by A for data transmission.
  4. After A receives the data VB and remote key sent by B, the network card will package them together with its own storage address VA into an RDMA READ request and send this message request to B. In this process, both A and B can store B’s data to A’s VA virtual address without any software participation.
  5. After A completes the storage, it will return the status information of the entire data transfer to B.

The WRITE process is similar to READ. The unilateral operation transmission method is the biggest difference between RDMA and traditional network transmission. It only needs to provide direct access to the remote virtual address, and does not require remote applications to participate, which is suitable for bulk data transmission.

RDMA Bilateral Operation (SEND/RECEIVE)

SEND/RECEIVE in RDMA is a bilateral operation, that is, the remote application must be aware of and participate in the completion of the transmission and reception. In practice, SEND/RECEIVE is often used for connection control messages, while data messages are mostly completed through READ/WRITE.

Taking the bilateral operation as an example, the process of host A sending data to host B (hereinafter referred to as A and B) is as follows:

  1. First of all, A and B must create and initialize their own QP and CQ.
  2. A and B register WQE in their own WQ. For A, WQ = SQ, WQE describes a data that is about to be sent; for B, WQ = RQ, WQE describes a Buffer for storing data.
  3. A’s network card asynchronously schedules to A’s WQE, parses that this is a SEND message, and sends data directly to B from the buffer. When the data stream arrives at B’s network card, B’s WQE is consumed, and the data is directly stored in the storage location pointed to by the WQE.
  4. After A and B communication is completed, a completion message CQE will be generated in A’s CQ indicating that the sending is completed. At the same time, a completion message will be generated in B’s CQ indicating that the reception is completed. The processing of each WQE in WQ will generate a CQE.

Bilateral operation is similar to the underlying Buffer Pool of traditional networks, and there is no difference in the participation process of the sender and receiver. The difference lies in zero-copy and kernel bypass. In fact, for RDMA, this is a complex message transmission mode, often used for transmitting short control messages.

References

  1. https://xie.infoq.cn/article/49103d9cf895fa40a5cd397f8
  2. https://zhuanlan.zhihu.com/p/55142557
  3. https://zhuanlan.zhihu.com/p/55142547