HyperLoop: NIC-Offloading for Faster Transactions in Storage Systems [SIGCOMM 2018]

An interesting and novel application for NIC-offloading

Our goal is to provide an easy, general and high-level context of this paper’s contributions and our take on the implication of the paper. Please refer to the actual paper for more details if this post interests you and please feel free to contact us for errors, changes and suggestions!

Hyperloop is one of the paper that I am personally quite interested about, since this is the research field that I personally spend a lot of time in. This paper proposes a nice use of NICs to offload a subset of functionalities from widely used existing applications to show performance boost. Let’s first cover some background knowledge required for this paper and then talk about the contributions of this paper.

Figure 1. Mellanox Bluefield Smart NIC.

Background

Network Interface Controller (NIC) (Figure 1) is a computer hardware that connects a computer to a computer network. Overtime, as similar to other computer hardware like Graphic Processing Unit (GPU), features and capabilities of a NIC has increased significantly. Some examples of cool features that go into modern NICs are as follows.

  • Smart NIC: NICs that come with its own small computer (CPU, RAM and storage). SmartNICs are capable of running complex programs that were traditionally run only on actual servers.
  • Single Root (Multi Root) I/O Virtualization (SR-IOV): Provide virtual interfaces from a single NIC, so the computer sees multiple network interfaces from a single physical interface.
  • Checksum Offloading: Verify each packet for errors by computing checksum within the NIC instead on the host CPU.

Given that NICs are specialized for fast packet processing, there have been many attempts to use its features for other tasks, such as offloading security protocols or firewalls. This paper made a novel attempt to offload transactions in the context of distributed storage systems in a multi-tenant setting. Let’s first go over what this statement means.

Figure 2. Single-tenant vs Multi-Tenant obtained from (https://hackernoon.com/exploring-single-tenant-architectures-57c64e99eece)

Distributed storage system is a term for a system that stores data across multiple machines. Some examples of distributed storage systems are block storages like Amazon S3, key-value stores like BigTable, and databases like Amazon RDS. These storage systems are assumed to be in a multi-tenant setting (Figure 2), which means, in a simple term, that there are multiple software running on a single machine. This implies that the storage systems must share precious resources, such as CPU and RAM, with other software running on the same machine.

Transaction in a database system is a unit of work performed in the system that must obey unique properties such as atomicity, consistency, isolation and durability (ACID). For example, a unit of work can include reading a integer data entry, adding one to it and then writing it back to the storage. This operation requires that the entry is not stale In a distributed setting, these transactions must be copied and communicated to all instances of the software running on multiple machines. This is a quite complex task, often requiring complicated algorithms to correctly implement, lot of resources to run the algorithms and careful thoughts on what the side-effects of the algorithms are in a multi-tenant setting. Since, faster transactions mean faster storage system, it is a wide area of research to make transactions correct and as fast as possible.

Which Problems is this Paper Trying to Solve?

The authors of Hyperloop argue that using only CPUs for performing replicated transactions for distributed storage systems in multi-tenant settings incur high cost, and is often unreliable, resulting in high latencies (time to complete the transaction). The culprit of this issue is that the CPUs are shared across many applications in multi-tenant settings. Thus, running replicated transactions on CPUs result unreliable performance, since the transactions often and randomly wait for other applications to free the CPUs. There are existing solutions to offload some of the operations required in transactions, such as offloading networking portion of the transactions (TCP offloading) to the NIC. However, the most critical and heavier operations still run on CPUs.

How is this Paper Solving these Problems?

The authors proposing completely eliminating the use of CPUs in performing transactions. Instead, the tasks that used to run on the CPUs, now run on Remote Direct Access Memory (RDMA) NICs with Non-Volatile Memory (NVM). RDMA is needed to directly manipulate the host’s memory from NIC and NVM is required to store persistent data.

Hyperloop contributes two things. The details of each contributions are specified in the paper.

  1. A method to run generic, pre-defined group-based RDMA operations on the NIC without using the CPU. (Group is a term that defines the collection of nodes participating in replicated transactions.)
  2. Four group-based RDMA primitives to perform group-based replications: Group Write (gWrite), Group Compare And Swap (gCAS), Group Memory Copy (gMEMCPY), and Group Memory Flush (gFLUSH).

What are the Results?

There are a quite number of significant results that can be obtained via Hyperloop with little effort. First, the authors mention that they modified RocksDB (an open source alternative to Google LevelDB) and MongoDB (an open source alternative to Azure CosmosDB and Amazon DynamoDB) to use HyperLoop with under 1000 lines of code. Secondly, Hyperloop’s group-based RDMA primitives has 800x better latency with smaller variance than traditional RDMA operations. Third, running MongoDB with HyperLoop decreases average latency of insert/update operations by 79%, while CPU usage on backup nodes goes down from nearly 100% to almost 0%.

My Take on this Paper

The major challenges in running any large scale application is to minimize randomness in the application’s performance, while keeping the cost of running the application minimal. Hyperloop does both at the same time. While Hyperloop reduces CPU usage on the host machine, it can also make your replicated transactions more predictable. That’s a quite significant contribution. Another strength of Hyperloop is that it can simply be applied into existing infrastructures that contains RDMA NICs with NVM. Although I have no data points on the proportion of the existing data centers that contain servers with RDMA NICs with NVMs, I imagine more data centers have RDMA NICs than Smart NICs. I am quite excited to see if Hyperloop can still be applied on Smart NICs, as Smart NICs have different architectures. I am also interested in seeing other applications that can use similar group-based primitives.


We are looking for passionate writers from all fields of CS!

I believe that one of the main and most important habit that a grad student must have is to read papers at a regular fashion. I thought that being a paper reviewer with a light amount of responsibility gives incentives to read more papers. So, I personally started this blog for me to record what I was reading and I found it really really helpful. I hope that more can join my experience and have a great learning experience in the process! Feel free to email me at yo2seol@cs.stanford.edu if interest!