RDMA: An Exploration of First Principles & Bootstrapping

ts
The Startup
Published in
10 min readApr 11, 2020

The following article hopes to (i) explain the first principles behind RDMA, a technology producing an order of magnitude performance increase in scale-limited systems (ii) discuss the shift in the governance of distributed machines from isolated messengers to open-addresable functors (iii) and demonstrate how to quickly bootstrap a cool technology.

Preview of Performance Enhancements in a High-Performance Application

Preface: A Review of Trends

The last decade observed an acceleration in the volume/diversity of I/O devices at the edge of the network and software applications living therein. The symbiotic relationship between the two magnified this trend. As new application layers required novel forms of computing, new hardware was created to adapt to their demands. Likewise, as new hardware evolved, new markets and experiences increased total time spent in the digital world. A tale of two funnels.

On the consumer side, the PC was eclipsed by mobile, wearables, and other interfaces now housing ambient computing resources. This opened markets inside homes, on top of watches, and into virtual reality. On the enterprise side, beyond cost improvements, industrial economies adopted specialized IoT devices, CPU chip-makers added hundreds more cores to machines, and GPUs extended their importance in the application layer. Across these two domains, the rise of the sharing-economy, autonomous, gaming, and social increased society’s reliance on the digital world, and in turn, the reliance on these instruments powering it.

The Cambrian explosion in data movement formed market needs for networks capable of efficiently moving 100gb/s, then 400gb/s, & now beyond. If the next decade grows at the same rate as the last, these trends will exacerbate as neural nets scale parabolically to trillions of parameters & apps shift digital worlds from two-dimensional planes into three-dimensional, live-streamed experiences.

Scene from “Ready Player One” — a form of the metaverse

Many core applications, from Paxos to DHTs, face scalability challenges due to long-standing principles for how homogeneous, distributed state machines should communicate — through the orifices & access control of a CPU. With the centralization of compute and networks into data centers, batches of trusted hardware form an opportunity for engineers to redefine these governing principles: displace the processing unit altogether with remote-direct-memory-access.

Introduction to RDMA

Traditional networking protocols, specifically TCP/IP, limit the number of active, concurrent connections due to bottlenecks in (i) OS kernels and (ii) processing the protocol itself. They implement a two-sided model: a sender copies and transfers data over the network to some receiver, who acts upon that information. An analysis on the TCP stack, shown below, confirms these bottlenecks by revealing where time is spent in handling 64B and 16KB messages (80% overhead from i and ii).

Throughput experiment courtesy of Intel / OSU Intel, https://mvapich.cse.ohio-state.edu/static/media/publications/slide/balaji_10gige.pdf

Remote Direct Memory Access (RDMA) bypasses the kernel and exchanges data without involving operating systems; that is, it treats remote storage as if it were local, enabling one to read and write memory in a remote machine without involving its processor. This removes the overheads that come with traditional networking protocols, as pictured in the comparison below, and helps scale concurrent connections.

In a traditional system, threads within a process on a single machine are able to share memory; however, when other processes attempt to interact with that memory, they are denied access: a concept known as isolation. Intrinsically, the kernel receives a system call to return a slice of memory, denying or granting access to that region of storage. RDMA enables a remote, trusted machine to access the contents without any function call/filter/processing unit. One mobile device can access the entire storage of another iPhone — without repeated permission. Therefore, RDMA deletes prior assumptions on isolation across a topology of machines.

Why allow agents in a malicious world to connect and read/write memory directly to your machine? The risks are significant. But, what if these are trusted batches of hardware in a closed, trusted environment? The need for strong isolation and other expensive principles can be relaxed, opening a new set of opportunities and challenges within network communication.

Accordingly, data centers have popularized the development and deployment of RDMA. Although previous software-defined technologies, such as DPDK, first introduced kernel bypassing, RDMA is currently the fastest technique as it deploys a dedicated network stack in hardware. From a high-level, the lifecycle of a connection is simple:

  1. Two machines establish a connection and exchange information akin to a handshake. In more detail, the remote OS pins a memory region for RDMA usage, applying protection and other setup but removing itself from the fast data path. The local OS receives a virtual memory address from the remote machine, storing the location and types of operations permitted.
  2. The local machine can start invoking commands; these include both one-sided operations (read, write, and atomic) as well as classical two-sided send/receive commands. The two-sided send operation transfers data to a queue of pre-allocated buffers in the remote machine. The two-sided receive operation polls this queue for work. According to previous work, although they are 2x slower in comparison to one-sided operations, they can replace socket-oriented systems while not shifting higher-level application logic; a plug-and-play accelerator, unlike the latter.
  3. After finishing, scarce memory resources are reclaimed in the RDMA-specific hardware.

The advantages of RDMA are especially noticed in state-machine replication (SMR), one of the most ubiquitous and important algorithms in modern computing. Herein, a common distributed consensus protocol is run on a set of servers in order to replicate data both within a single region and across the globe. SMR algorithms are significant for two reasons: latency (represents response time to a client, measured in milliseconds) and fault-tolerance (the ability to continue serving requests in the presence of errors, measured in availability). Data only moves as fast as the speed of light, and thus bringing data closer to users across the world reduces length of travel (i.e., a global cache). Additionally, if a set of servers in Europe shut down, Google and other mission-critical applications need to continue serving requests by re-routing clients to another location; that is, business continuity is critical. Traditionally, SMR leveraged TCP/IP and other socket-oriented networking approaches. Since the consensus protocols (often Paxos/Raft) require leaders to communicate with all replicas, the bottlenecks of TCP/IP in this highly concurrent environment are enlarged. In the experiment below, RDMA proved to only incur a 2µs cost for each outbound message, producing exceptional performance advantages in comparison to state-of-the art implementations.

APUS: Fast & Scalable Paxos on RDMA https://i.cs.hku.hk/~heming/papers/socc17-apus.pdf

In his book The Innovator’s Dilemma, Clayton Christenson constructs an important analysis on disruptive technologies: they often start out inferior to prior, more mature infrastructures in performance and/or unit costs. The evolution of RDMA embodied this behavior. And although RDMA continues to permeate into many applications, it still has much improvement in providing more abstractions, higher performance per unit of cost, and a stronger ecosystem of tools.

A Note on RDMA Development:

Mellanox was a pioneer in developing RDMA, originally producing the most sophisticated hardware and releasing an extensive implementation spec. As NVIDIA moved into cloud computing, gained market power, and expanded free cash flow, they acquired Mellanox in 2019 in a $6.8 billion, all-cash transaction. Although Mellanox’s revenue is marginal in comparison to the rest of NVIDIA’s business, they acquired important patents, talented engineers, and core technologies central to their mission: powering the digital economy with cutting-edge architectures and software libraries. Although their historic competitive advantage has been confined to AI and graphics, GPUs continue to bleed into other application and data layers. This trend follows their focus on expanding research efforts into the entire data center stack. As noted by CEO Jensen Huang when commenting on the Mellanox acquisition, “the data center has become the most important computer in the world, the thing that’s really exciting is that the computer no longer starts and ends at the server. The computer in the future would extend into the network. … And what we want to do is, we want to extend our computing reach from the server, where we are today, out to the entire data center.” Nonetheless, RDMA and the rest of NVIDIA’s product lines represent the need for specialization in an increasingly complex world.

Get Started

Hopefully, this Get Started guide will promote experimentation beyond the small community of software engineers operating cloud services. Distributed systems create significant arbitrage, and accordingly, they are often difficult to approach. They present an opportunity for engineers to act as “shapers,” helicoptering up and down the software stack by designing a system at a high-level while optimizing for low-level, first principles. RDMA is especially difficult given its complexity and lack of online tutorials. Thankfully, Claude Barthels Infinity library provides a great starting point by abstracting away details; he packages thousands of lines of code and presents an interface that only requires a few commands to accomplish RDMA operations. This helps newcomers focus on simpler abstractions.

  1. Find a cloud provider that houses Linux servers with InfiniBand in their hardware stack (for example, Azure). For academics with access to CloudLab, rent a cluster of “r320” or “c6220” machines
  2. Install the OFED drivers (if not already present)
apt-get install libmlx4–1 infiniband-diags ibutils ibverbs-utils rdmacm-utils

Note: IbVerbs forms the necessary software layer to interact with RDMA-specific hardware. Intel explains the library in detail here: https://www.ibm.com/support/knowledgecenter/ssw_aix_72/rdma/rdma_pdf.pdf

3. Clone and build the Infinity library

git clone https://github.com/claudebarthels/infinity; cd infinity; make library

4. Start developing

The following pseudocode adapts Barthel’s example for a system with one server and K clients

Server

// Create new context
infinity::core::Context *context = new infinity::core::Context();

// Create a queue pair
infinity::queues::QueuePairFactory *qpFactory = new infinity::queues::QueuePairFactory(context);
printf("Creating buffers for client to read from and write to\n");
infinity::memory::Buffer *bufferToReadWrite = new infinity::memory::Buffer(RDMA_MEMORY_REGION);
infinity::memory::RegionToken *bufferToken = bufferToReadWrite -> createRegionToken();
printf("Creating buffers to receive two-sided messages\n");
for (int i = 0; i < BUFFER_COUNT; i++) {
infinity::memory::Buffer *bufferToReceive = new infinity::memory::Buffer(context, 128 * sizeof(char));
//must "register" and push our buffers onto the queue!
context -> postReceiveBuffer(bufferToReceive);
}

qpFactory -> bindToPort(PORT_NUMBER);
for (int i = 0; i < NUM_CLINETS; i++) {
printf("Setting up connection %d (blocking)\n", i);
qp = qpFactory -> acceptIncomingConnection(bufferToken, sizeof(infinity::memory::RegionToken));
}
//If we want to service two-sided communications std::vector<std::thread> threads;
for (int i = 0; i < NUM_THREADS; i++) {
threads.push_back(std::thread(process_queue, context));
}
void process_queue(infinity::core::Context *context) {
infinity::core::receive_element_t receiveElement;
while(!context->receive(&receiveElement));
processData(receiveElement.buffer);
//reuse buffer by posting back onto the queue
context -> postReceiveBuffer(receiveElement.buffer);
}
  1. The program initializes a “queue pair factory”, which creates and records “queue pairs” into the context object. These queue pairs represent a connection between two machines
  2. We then register two types of buffers: (1) a memory region (i.e., storage layer) that can be accessed by clients through one-sided operations (read/write), denoted by the “region token” (2) a discrete series of buffers that represent unused memory regions, which will be accessed in a queue by clients for two-sided operations. Later on, the clients will push messages into these buffers, and the server can poll the queue in order to process/recycle them.
  3. Accept incoming connections for however many clients the system desires
  4. If we want to handle two-sided operations, the process can offload this work to any number of threads. The queue is already synchronized, and accordingly, we do not need to worry about concurrency issues when polling for work!

Client

infinity::core::Context *context = new infinity::core::Context();
infinity::queues::QueuePairFactory *qpFactory = new infinity::queues::QueuePairFactory(context);
//Connect to a remote host– if invalid, the program aborts
infinity::queues::QueuePair *qp = qpFactory -> connectToRemoteHost(SERVER_IP, PORT_NUMBER);

//Create and register a local buffer
infinity::memory::Buffer *localBuffer = new infinity::memory::Buffer(context, BUFFER_SIZE);

//Store the remote token in which maps to valid memory regions
infinity::memory::RegionToken *remoteBufferToken = (infinity::memory::RegionToken *) qp -> getUserData();

//Read (one-sided) from a remote buffer and wait for completion
infinity::requests::RequestToken requestToken(context);
qp -> read(
localbuffer,
0, //locally offset by 0 bytes
remoteBufferToken,
1024, //remote offset by 1024 bytes
128, //read a length of 128 bytes into the buffer
infinity::queues::OperationFlags(),
&requestToken
);
requestToken.waitUntilCompleted();

//Write (one-sided) content of a local buffer to a remote buffer and do not wait for completion
qp -> write(localBuffer, remoteBufferToken, &requestToken);
requestToken.waitUntilCompleted();
infinity::memory::Atomic *atomicOperation = new infinity::memory::Atomic();//Atomic compare and swap operation
qp -> compareAndSwap(
remoteBufferToken,
atomicOperation,
0, //compare value
1, //swap value, only propagated if remote value == compare value
infinity::queues::OperationFlags(),
64, //read offset
&requestToken
);
requestToken.waitUntilCompleted();
if (atomicOperation -> getValue() == 1) {
//previously set to 1
}
// Send (two-sided) content of a local buffer over the queue pair and wait for completion
qp -> send(localBuffer, &requestToken);
requestToken.waitUntilCompleted();
  1. Akin to the server, the program instantiates a queue pair factory / global context
  2. Connects to the remote server
  3. Creates a local buffer for storing remote reads and propagating writes
  4. Invokes various types of requests, including one-sided reads/writes, two-sided sends, and atomic operations (compare-and-swap) if and only if the remote memory region allows it

After modifying the provided makefile for the client/server program above, deploy the programs onto two servers and watch magic unravel. Advanced examples in which focus on concurrency at the client level can be found on my GitHub at: __deleted__

--

--