Gepostet vor 2021-09-2010 minutes lesen (Über 1454 Wörter)

From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees

背景

Bourbon 基于 WiscKey（比 LevelDB 和 RocksDB 更快的 LSM-Tree 存储系统）实现。本文通过对 WiscKey 的实验分析，总结了五条指南。Bourbon 应用了这些指南，将 WiscKey 与学习索引相结合，利用学习索引实现快速查找。

设计与实现

Five learning guidelines

Favor learning files at lower levels (生命周期更长)
Wait before learning a file (连续的 compaction 会在每一层生成生命周期非常短的新文件)
Do not neglect files at higher levels (可以提供多种内部查询方式)
Be workload- and data-aware (在不同的场景中查找会有很大差异)
Do not learn levels for write-heavy workloads (变化很快)

Beneficial regimes

学习索引只能缩短检索时间。

Bourbon Learning

Bourbon 使用分段线性回归（PLR）对数据进行建模，PLR 在学习和查找过程中开销较低，而且空间开销也很小。Bourbon 可以学习单个 SSTables文件（File Learning）或整个层级（Level Learning）。层级学习对于只读工作负载收益更高，而对于混合工作负载，层级学习的性能不如文件学习。

为了训练 PLR 模型，Bourbon 使用了 Greedy-PLR 算法，一次处理一个数据点，如果数据点不能在不超出定义的误差限制的情况下被添加到当前的线段中，那么将创建一个新的线段，并将数据点添加到其中，最终 Greedy-PLR 生成了一组代表数据的线段。Greedy-PLR 的运行时间与数据点的数量呈线性关系。

Cost-benefit analyzer (CBA)

成本效益分析（Cost-Benefit Analyzer, CBA）的目的是过滤掉不值得学习的生命周期较短的文件。CBA 在同一层级使用之前文件的统计信息。

为了过滤短生命周期文件，Bourbon 在学习文件之前会等待一个时间阈值 $T_{wait}$。对一个文件进行学习的最大耗时约为 40 毫秒，因此 Bourbon 将 $T_{wait}$ 设置为 50 毫秒。但是，学习一个长期存在的文件也可能没有好处。作者实验发现更低层级的文件通常寿命更长，对于有的工作负载和数据集，他们服务的查询操作比更高层级的文件要少得多。因此，除了考虑模型对应的开销以外，还需要考虑模型可能带来的收益。如果一个模型的收益（$B_{model}$）大于构建该模型的开销（$C_{model}$），那么该模型就是有利的。

Estimating the cost ($C_{model}$)

如果假定学习过程发生在后台（有很多空闲的 core），那么 $C_{model}$ 开销就为 0。不过 Bourbon 采取了保守的做法，假设学习线程会对系统产生干扰，导致性能变慢。因此，论文将 $C_{model}$ 开销定义为与 $T_{build}$ 相等，即文件训练 PLR 模型的时间。由于 $T_{build}$ 与文件中数据点的数量成线性比例，我们将 $T_{build}$ 定义为文件中数据点的数量与训练一个数据点的平均时间的乘积。

Estimating the benefit ($B_{model}$)

Bourbon 定义模型带来的收益如下:

$$
B_{model} = (T_b - T_m) * N
$$

$T_{b}$: 在基线中查找的平均时间
$T_{m}$: 在模型路径中查找的平均时间
$N$: 文件在其生命周期内的查找次数

然后又将查询操作又划分成了 negative 和 positive，因为大多数 negative 的查询操作在 filter 处就终止了，所以最终的收益模型为:

$$
B_{model} = ((T_{n.b} - T_{n.m}) * N_n) + ((T_{p.b} - T_{p.m}) * N_p)
$$

$N_{n}$ & $N_{p}$: negative 和 positive 的查询数量
$T_{n.b}$ & $T_{p.b}$: 在基线中 negative 和 positive 查找的平均时间
$T_{n.m}$ & $T_{p.m}$: 在模型路径中 negative 和 positive 查找的平均时间

为了估计查找次数（$N_{n}$ & $N_{p}$）和查找所需时间（$T_{n.b}$ & $T_{p.b}$），CBA 维护了这些文件在其生命周期内和在同一层级上的文件的统计信息（因为统计信息在不同层级之间存在显著差异）。

这些估算在 $T_{wait}$ 期间完成:

$T_{n.b}$ & $T_{p.b}$: 在 $T_{wait}$ 期间，查询将通过基线路径，这些查询时间用于估计 $T_{n.b}$ 和 $T_{p.b}$
$T_{n.m}$ & $T_{p.m}$: 通过同一层级所有其他文件的平均值进行估计
$N_{n}$ & $N_{p}$: CBA 首先获取该层级中其他文件的平均 negative 查找和 positive 查找，然后，将其按 $f = s / s’$ 的倍数进行缩放，其中 $s$ 是文件的大小，$s’$ 是该层级的文件的平均大小

如果 $C_{model} < B_{model}$，文件将会开始训练。如果多个文件同时开始训练，则将它们放在一个最大优先级队列中，从而使收益最大的文件能够先被训练。

未来可能的改进方向包括：

对 $N_{n}$, $N_{p}$, $T_{n.m}$, $T_{p.m}$ 进行更精确的估计
改进计算 $C_{model}$ 开销的方法，使其不仅仅是 $T_{build}$
使用另外的函数对训练队列进行排序，而非 $B_{model} - C_{model}$

Bourbon: Putting it All Together

评估

参考

Gepostet vor 2021-08-0916 minutes lesen (Über 2373 Wörter)

Octopus: an RDMA-enabled Distributed Persistent Memory File System

概述

Octopus 通过 NVM + RDMA 实现了分布式文件系统，主要贡献总结如下：

提出了基于 RDMA 的新型 I/O 流，它直接访问持久化共享内存池，而无需经过文件系统层层调用，并在客户端主动获取或发送数据以重新平衡服务器和网络负载。
利用 RDMA 原语重新设计元数据机制，包括自标识元数据 RPC 来用于低延迟通知，以及收集-调度分布式事务实现低一致性开销。
有效地利用了硬件性能，显著优于现有 RDMA 优化的分布式文件系统。

挑战

之前由于存储介质较慢，访问持久化存储介质的开销几乎占据了文件系统操作的全部，软件栈的优化对整体性能的影响微乎其微，所以之前的存储层与网络层往往采用松耦合的设计模式来使得系统更容易理解与实现。而对于 NVM 上的文件系统来说，由于 NVM 的访问时延接近内存，使得软件栈的开销几乎占据了文件系统操作的全部，现在优化软件栈开销成为优化系统最重要的手段。
现有的文件系统利用新硬件高带宽的效率低下。这主要有四个原因：(1)数据在应用缓冲区、文件系统页缓存、网卡缓冲等区域之间来回拷贝，增加了软件开销；(2)服务器每秒需要处理的大量请求，server 端 CPU 成为瓶颈；(3)基于事件驱动模型的传统 RPC 具有相对较高的延迟；(4)分布式文件系统中使用到的分布式事务等需要多次的网络来回，处理逻辑较为复杂，使得分布式文件系统用在一致性上的开销较大。

设计

总体架构

Octopus 文件系统数据分散在集群节点中，一个文件根据路径名使用一致性哈希保存在一个节点中，文件没有做冗余，RDMA 与存储层紧耦合设计。
每个节点的 NVM 分为私有部分和共享部分。私有部分保存该节点文件的元数据，只允许本节点内访问，客户端通过 RPC 访问；共享部分保存该节点中文件的数据，客户端可以通过 RDMA 单边原语直接读写。Octopus 使用 RDMA write-with-imm 进行 RPC。

数据布局

Octopus 每个节点的 NVM 划分为 6 个区域，每个区域是共享的或私有的。这六个区域的内容为：

Super Block：用于存储文件系统的超级块
Message Pool：元数据 RPC 的通信缓冲区
Metadata Index Zone：使用哈希表保存文件索引
Metadata Zone：具体的文件元数据保存区域
Data Zone：文件数据的保存区域
Log Zone：事务日志区域

High-Throughput Data I/O

Octopus 引入了共享持久化内存池来减少数据拷贝以获得更高的带宽，并且在客户端主动执行 I/O 来重新平衡服务器和网络开销以获得更高的吞吐量。

Shared Persistent Memory Pool

如图所示，GlusterFS 里同一个数据在传输过程中拷贝了 7 次，Octopus 利用共享持久化内存池取消层次抽象，以及通过 RDMA 取消缓存，将数据拷贝降低到了 4 次。

Client-Active Data I/O

Octopus 提出了 Client-Active Data I/O 数据访问模式，充分发挥了 RDMA 的优势，降低了服务端 CPU 的负载。

传统的数据交互为 Server-Active Data I/O 方式，如图中的(a)所示，客户端给服务端发送数据访问请求，服务端查找到数据的位置后读取对应的数据，并把最终的数据返回给客户端。而 Client-Active Data I/O 方式则与此不同，如图中的(b)所示，客户端使用自标识元数据 RPC 发送读写请求和访问元数据，然后根据元数据信息使用 RDMA Read/Write 直接读写文件数据。
Client-Active 与 Server-Active 相比，服务端 CPU 执行一个请求的操作较少，将读数据操作转移给了客户端进行，从而降低了服务端 CPU 的负载。

Low-Latency Metadata Access

RDMA 为远程数据访问提供微秒级访问延迟。为了在文件系统发挥这一优势，Octopus 通过合并 RDMA 写入和原子原语重构了元数据 RPC 和分布式事务。

Self-identified metadata RPC

RDMA 具有低延迟高带宽的优势，在 RPC 中使用 RDMA 能够提高吞吐量。以往的 RDMA RPC 大多使用双边 RDMA 原语，而双边 RDMA 具有相对较高的延迟和较低的吞吐量，减小了 RDMA 的优势。而单边 RDMA 原语不会在完成时通知 CPU，若使用单边 RDMA 进行 RPC，服务器需要有单独的线程轮询消息缓冲区，使 CPU 的负载进一步增大。
为了保持 RDMA 低延迟的优势并减少 CPU 的负载，Octopus 提出了 Self-identified metadata RPC（自标识元数据 RPC）。自标识元数据 RPC 使用 RDMA write_with_imm 命令将发送者的标识信息附加到 RDMA 请求中。write_with_imm 与传统的 RDMA Write 相比有以下两点不同：(1) 它能够在消息中携带一个立即数；(2)它能够在服务器网卡接收到该请求后立即通知服务器 CPU，这会消耗服务器的 QP 中的一个 Recv WR。因此，使用 write_with_imm 能够让服务器及时收到 RPC 请求，且无需 CPU 进行轮询。立即数字段中附加有客户端标识符 node_id 和客户端接收缓冲区的 offset。node_id 可帮助服务器定位对应消息而无需扫描整个缓冲区。请求处理完成之后，服务器使用 RDMA Write 原语将数据返回到标识符为 node_id 的客户端中偏移量地址 offset 处。

Collect-Dispatch Transaction

在文件系统中，某些操作可能会使用分布式事务进行，如 mkdir，mknod，rmnod 和 rmdir 等等。这些操作需要在多个服务器之间原子性地更新元数据。在之前的分布式文件系统中，往往使用两阶段提交（2PC）完成事务操作。然而两阶段提交由于其分布式的日志以及对锁和日志的协调而导致高昂的开销。
Octopus 设计了一个新的分布式事务协议：Collect-Dispatch Transaction，该协议分为收集阶段（Collect Phase）和分发阶段（Dispatch Phase）。该事务协议利用了 RDMA 原语进行服务器间的交互，关键思想在于两个方面，分别是崩溃一致性和并发控制：

带有远程更新的本地日志来实现崩溃一致性。在收集阶段，Octopus 从参与者收集读写集合，并在协调者中执行本地事务，记录日志。由于参与者不需要保留日志记录，因此无需为协调者和参与者之间的持久化日志进行复杂的协商，从而减少了协议的开销。在分发阶段，协调者使用 RDMA write 将更新的写集合分发给参与者，并使用 RDMA atomic 原语释放相应的锁，而不涉及到参与者 CPU。
混合使用 GCC 和 RDMA 锁来实现并发控制。在 Collect-Dispatch 事务中，协调者和参与者使用 GCC 的 Compare-and-Swap 命令在本地添加锁。解锁时，协调者使用 GCC 的 Compare-and-Swap 命令释放本地锁，并使用 RDMA 的 Compare-and-Swap 命令释放远端每个参与者的锁，解锁操作不涉及到参与者的 CPU，因此优化了解锁阶段。

总的来说，Collect-Dispatch Transaction 需要一次 RPC（COLLECT-REQ 和 WRITE-SET）、一次 RDMA Write（UPDATE WRITESET）和一次 RDMA Atomic（REMOTE UNLOCK），而两阶段提交需要两次 RPC。Collect-Dispatch Transaction 与 2PC 相比具有较低的开销，这是因为：(1) 一次 RPC 比一次 RDMA Write/Atomic 原语具有更高的延迟；(2) RDMA Write/Atomic 原语不需要服务端 CPU 的介入。

总结

Octopus 其核心思想是 RDMA 与 NVM 紧耦合设计，设计了一系列机制来实现高吞吐量的数据 I/O 以及低延迟的元数据访问。但在分布式方面该文件系统没有做冗余，也没有考虑负载均衡等内容，只是通过一致性哈希将数据分散到不同节点上。

参考

Lu, Youyou, et al. “Octopus: An RDMA-Enabled Distributed Persistent Memory File System.” USENIX ATC ’17 Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference, 2017, pp. 773–785.

Gepostet vor 2021-07-2816 minutes lesen (Über 2469 Wörter)

RDMA Introduction

RDMA (Remote Direct Memory Access) refers to remote direct memory access, which is a method of transferring data in a buffer between two applications over a network.

Remote: Data is transferred over a network with remote machines.
Direct: Without the participation of the kernel, all information related to sending transmissions is offloaded to the network card.
Memory: Data is transferred directly between user space virtual memory and the network card without involving the system kernel, with no additional data movement or copying.
Access: Operations such as send, receive, read, write, atomic, etc.

RDMA is different from traditional network interfaces because it bypasses the operating system kernel. This gives programs that have implemented RDMA the following characteristics:

Absolute minimum latency
Highest throughput
Smallest CPU footprint (that is, areas where CPU involvement is minimized)

RDMA Working Principles

During the RDMA communication process, for both sending and receiving, and read/write operations, the network card directly transfers data with the memory region that has already been registered for data transfer. This process is fast, does not require CPU participation, and the RDMA network card takes over the work of the CPU, saving resources for other calculations and services.

The working process of RDMA is as follows:

When an application performs an RDMA read or write request, it doesn’t perform any data copying. Under the condition that no kernel memory is required, the RDMA request is sent from the application running in user space to the local network card.
The network card reads the content of the buffer and transmits it to the remote network card over the network.
The RDMA information transmitted over the network includes the virtual memory address of the target machine and the data itself. The completion of the request can be completely handled in user space (by polling the RDMA completion queue in user space). RDMA operations enable applications to read data from or write data to the memory of a remote application.

Therefore, RDMA can be simply understood as the use of relevant hardware and network technology, allowing the network card to directly read and write the memory of a remote server, ultimately achieving high bandwidth, low latency, and low resource utilization effects. The application does not need to participate in the data transmission process, it only needs to specify the memory read/write address, start the transmission, and wait for the transmission to complete.

RDMA Data Transmission

RDMA Send/Recv
This is similar to TCP/IP’s send/recv, but different in that RDMA is based on a message data transfer protocol (not a byte stream transfer protocol), and all packet assemblies are done on RDMA hardware. This means that the bottom 4 layers of the OSI model (Transport Layer, Network Layer, Data Link Layer, Physical Layer) are all completed on RDMA hardware.
RDMA Read
The essence of RDMA read operation is a Pull operation, pulling data from remote system memory back to local system memory.
RDMA Write
The essence of RDMA write operation is a Push operation, pushing data from local system memory to remote system memory.
RDMA Write with Immediate Data (RDMA write operation supporting immediate data)
RDMA write operation supporting immediate data essentially pushes out-of-band data to the remote system, which is similar to out-of-band data in TCP. Optionally, an Immediate 4-byte value can be sent along with the data buffer. This value is presented as part of the receipt notice to the receiver and is not included in the data buffer.

RDMA Programming Basics

To use RDMA, we need a network card that supports RDMA communication (i.e., implements the RDMA engine). We call this card an HCA (Host Channel Adapter). Through the PCIe (peripheral component interconnect express) bus, the adapter creates a channel from the RDMA engine to the application’s memory. A good HCA will implement all the logic needed for the executed RDMA protocol on hardware. This includes packetization, reassembly as well as traffic control and reliability assurance. Therefore, from the perspective of the application, it only needs to handle all the buffers.

As shown in the above figure, in RDMA programming, we use the command channel to call the kernel mode driver to establish the data channel, which allows us to completely bypass the kernel when moving data. Once this data channel is established, we can directly read and write the data buffer. The API to establish a data channel is an API called verbs. The verbs API is maintained by a Linux open-source project called the Open Fabrics Enterprise Distribution (OFED).

Key Concepts

RDMA operation starts with memory operation. When you operate on memory, you are telling the kernel that this segment of memory is occupied by your application. So, you tell the HCA to address on this segment of memory and prepare to open a channel from the HCA card to this memory. We call this action registering a memory region MR (Memory Region). When registering, you can set the read and write permissions of the memory region (including local write, remote read, remote write, atomic, and bind). The Verbs API ibv_reg_mr can be used to register MR, which returns the remote and local keys of MR. The local key is used for the local HCA to access local memory. The remote key is provided to the remote HCA to access local memory. Once the MR is registered, we can use this memory for any RDMA operation. In the figure below, we can see the registered memory region (MR) and the buffer located within the memory region used by the communication queue.

RDMA communication is based on a collection of three queues SQ (Send Queue), RQ (Receive Queue), and CQ (Completion Queue). The Send Queue (SQ) and Receive Queue (RQ) are responsible for scheduling work, they are always created in pairs, called Queue Pair (QP). The Completion Queue (CQ) is used to send notifications when instructions placed on the work queue are completed.

When a user places instructions on the work queue, it means telling the HCA which buffers need to be sent or used to receive data. These instructions are small structures, called Work Requests (WR) or Work Queue Elements (WQE). A WQE mainly contains a pointer to a buffer. A WQE placed in the Send Queue (SQ) contains a pointer to a message to be sent; a pointer in a WQE placed in the Receive Queue points to a buffer, which is used to store the message to be received.

RDMA is an asynchronous transmission mechanism. Therefore, we can place multiple send or receive WQEs in the work queue at once. The HCA will process these WQEs as quickly as possible in order. When a WQE is processed, the data is moved. Once the transmission is completed, the HCA creates a Completion Queue Element (CQE) with a successful status and places it in the Completion Queue (CQ). If the transmission fails for some reason, the HCA also creates a CQE with a failed status and places it in the CQ.

Example (Send/Recv)

Step 1: Both system A and B create their own QPs and CQs, and register the corresponding memory regions (MR) for the upcoming RDMA transfer. System A identifies a buffer, the data of which will be moved to system B. System B allocates an empty buffer to store data sent from system A.

Step 2: System B creates a WQE and places it in its Receive Queue (RQ). This WQE contains a pointer, which points to a memory buffer to store received data. System A also creates a WQE and places it in its Send Queue (SQ), the pointer in the WQE points to a memory buffer, the data of which will be transmitted.

Step 3: The HCA on system A always works on hardware, checking if there are any WQEs in the send queue. The HCA will consume the WQE from system A and send the data in the memory region to system B as a data stream. When the data stream starts to arrive at system B, the HCA on system B consumes the WQE from system B and puts the data into the designated buffer. The data stream transmitted on the high-speed channel completely bypasses the operating system kernel.

Note: The arrows on the WQE represent pointers (addresses) to user space memory. In receive/send mode, both parties need to prepare their own WQEs (WorkQueue) in advance, and the HCA will write (CQ) after completion.

Step 4: When the data movement is completed, the HCA creates a CQE. This CQE is placed in the Completion Queue (CQ), indicating that data transmission has been completed. The HCA creates a CQE for each consumed WQE. Therefore, placing a CQE in the completion queue of system A means that the send operation of the corresponding WQE has been completed. Similarly, a CQE will also be placed in the completion queue of system B, indicating that the receive operation of the corresponding WQE has been completed. If an error occurs, the HCA will still create a CQE. The CQE contains a field to record the transmission status.

In IB or RoCE, the total time to transmit data in a small buffer is about 1.3µs. By simultaneously creating a lot of WQEs, data stored in millions of buffers can be transmitted in one second.

RDMA Operation Details

In RDMA transfer, Send/Recv is a bilateral operation, i.e., it requires the participation of both communicating parties, and Recv must be executed before Send so that the other party can send data. Of course, if the other party does not need to send data, the Recv operation can be omitted. Therefore, this process is similar to traditional communication. The difference lies in RDMA’s zero-copy network technology and kernel bypass, which results in low latency and is often used for transmitting short control messages.

Write/Read is a unilateral operation, as the name suggests, read/write operations are executed by one party. In actual communication, Write/Read operations are executed by the client, and the server does not need to perform any operations. In RDMA Write operation, the client pushes data directly from the local buffer into the continuous memory block in the remote QP’s virtual space (physical memory may not be continuous). Therefore, it needs to know the destination address (remote addr) and access rights (remote key). In RDMA Read operation, the client directly fetches data from the continuous memory block in the remote QP’s virtual space and pulls it into the local destination buffer. Therefore, it needs the memory address and access rights of the remote QP. Unilateral operations are often used for bulk data transfer.

It can be seen that in the unilateral operation process, the client needs to know the remote addr and remote key of the remote QP. These two pieces of information can be exchanged through Send/Recv operations.

RDMA Unilateral Operation (READ/WRITE)

READ and WRITE are unilateral operations, where only the source and destination addresses of the information need to be clearly known at the local end. The remote application does not need to be aware of this communication, and the reading or writing of data is completed through RDMA between the network card and the application Buffer, and then returned to the local end by the remote network card as encapsulated messages.

For unilateral operations, take storage in the context of a storage network as an example, the READ process is as follows:

First, A and B establish a connection, and the QP has been created and initialized.
The data is archived at B’s buffer address VB. Note that VB should be pre-registered with B’s network card (and it is a memory region) and get the returned remote key, which is equivalent to the permission to operate this buffer with RDMA.
B encapsulates the data address VB and key into a dedicated message and sends it to A, which is equivalent to B handing over the operation right of the data buffer to A. At the same time, B registers a WR in its WQ to receive the status returned by A for data transmission.
After A receives the data VB and remote key sent by B, the network card will package them together with its own storage address VA into an RDMA READ request and send this message request to B. In this process, both A and B can store B’s data to A’s VA virtual address without any software participation.
After A completes the storage, it will return the status information of the entire data transfer to B.

The WRITE process is similar to READ. The unilateral operation transmission method is the biggest difference between RDMA and traditional network transmission. It only needs to provide direct access to the remote virtual address, and does not require remote applications to participate, which is suitable for bulk data transmission.

RDMA Bilateral Operation (SEND/RECEIVE)

SEND/RECEIVE in RDMA is a bilateral operation, that is, the remote application must be aware of and participate in the completion of the transmission and reception. In practice, SEND/RECEIVE is often used for connection control messages, while data messages are mostly completed through READ/WRITE.

Taking the bilateral operation as an example, the process of host A sending data to host B (hereinafter referred to as A and B) is as follows:

First of all, A and B must create and initialize their own QP and CQ.
A and B register WQE in their own WQ. For A, WQ = SQ, WQE describes a data that is about to be sent; for B, WQ = RQ, WQE describes a Buffer for storing data.
A’s network card asynchronously schedules to A’s WQE, parses that this is a SEND message, and sends data directly to B from the buffer. When the data stream arrives at B’s network card, B’s WQE is consumed, and the data is directly stored in the storage location pointed to by the WQE.
After A and B communication is completed, a completion message CQE will be generated in A’s CQ indicating that the sending is completed. At the same time, a completion message will be generated in B’s CQ indicating that the reception is completed. The processing of each WQE in WQ will generate a CQE.

Bilateral operation is similar to the underlying Buffer Pool of traditional networks, and there is no difference in the participation process of the sender and receiver. The difference lies in zero-copy and kernel bypass. In fact, for RDMA, this is a complex message transmission mode, often used for transmitting short control messages.