Datacenter RPCs can be General and Fast

Frank Wang
Frankly speaking
Published in
3 min readMar 17, 2019

This is part of a week(-ish) blog series where I discuss my random thoughts as a recovering academic (mostly about research and tech). I am currently an investor at Dell Technologies Capital in Silicon Valley. You can follow me on Twitter and LinkedIn.

As an announcement, I have started a newsletter. The newsletter will include an abridged version of these blog posts along with security tweets, news, and beyond. sHere is the signup page.

This week, I will be discussing one of the paper awarded best paper from this year’s NSDI, the premier academic networking and systems conference. It is joint work between CMU and Intel Labs. Here is the full paper.

Problem

Modern datacenter networks are fast. They are 100 Gbps, 2 µs RTT under one switch, and 300 ns per switch hop. Existing networking options sacrifice performance or generality. TCP and gRPC are general but slow. DPDK and RDMA make simplifying assumptions, which makes them specialized and fast.

Solution

They develop eRPC, which provides both speed and generality. There are three main challenges.

  1. Managing packet loss
  2. Low-overhead transport
  3. Easy integration for existing applications

How do they manage packet loss? The problem is that there are millisecond timeouts for small RPCs.

Hardware solutions involve lossless link layers (e.g. PFC, InfiniBand). Although they provide simple/cheap reliability. They are prone to deadlocks and unfairness. eRPC’s solution is a relaxed requirement for rare loss, supported by existing networks. In low-latency networks, switch buffers prevent most loss. All modern switches have buffers >> BDP. A small BDP + sufficient switch buffer leads to rare loss.

How do they create a low-overhead transport layer? The idea is to optimize for the common case, e.g. optimized DMA buffer management for rare packet loss or optimized congestion control for uncongested networks. There are many more examples in the paper.

Example: Optimized DMA buffer management for rare packet loss

The solution is to use the server’s response in common case, and flush DMA queue during rare loss.

Another example is efficient congestion control in software. There is overhead to congestion control, e.g. rate limiter overhead. eRPC’s solution is to optimize for uncongested networks. Datacenter networks are usually uncongested. Common-case optimizations matter as seen in the graph below.

The result is low overhead transport with congestion control.

How do they easily integrate with existing applications? Replication over eRPC is fast. Raft-over-eRPC does not have network or object size constraints.

This is a cool project with a lot of core principles about how to provide general and fast datacenter RPCs. Given fast packet I/O, they can provide fast networking in software. If you want to learn more about eRPC, here is the landing page. More details about how they solve the challenges are available in their paper.

If you have questions, comments, future topic suggestions, or just want to say hi, please send me a note at frank.y.wang@dell.com.

--

--

Frank Wang
Frankly speaking

Investor at Dell Technologies Capital, MIT Ph.D in computer security and Stanford undergrad, @cybersecfactory founder, former @roughdraftvc