On Kernel-Bypass Networking and Programmable Packet Processing
The network stack architecture is under a revolution because the network is becoming faster than the CPU. Since the breakdown of Dennard scaling in 2006, the single-threaded performance of a CPU has stagnated [Rupp, 2014]. Meanwhile, NICs are getting faster and faster. 10 GbE NICs are a commodity today, but at high-end NICs are already at 200 GbE and improving further. The faster the NIC becomes, the smaller the time budget to process an individual packet is. For example, on a 10 GbE NIC, the time between two 1538-byte packets is 1230 ns [Corbet, 2015]. However, with 200 GbE, the time between such packets is as already low as 61 ns. These high packet rates impose a significant challenge to both the hardware and OS network stack architecture. In short, the traditional in-kernel network stack design is inadequate to keep up with the high rate of packets.
In this article, we first look at the deficiencies of the POSIX sockets API and their in-kernel implementations. We then discuss kernel-bypass networking and programmable packet processing, including offloading to SmartNICs, to understand how the network stack is changing to meet the needs of contemporary hardware and workloads.
POSIX sockets
POSIX sockets are the standard programming interface for networking. First appeared in 1983 as BSD sockets, POSIX sockets have been adopted by most commodity operating systems from Linux to Windows. In the POSIX socket model, applications create a socket, which represents a flow, and use that socket’s file descriptor to send and receive data over the network. For example, a server application that wants to accept TCP connections first creates a socket with the socket() operation, then binds the socket to an interface or an address with the bind() operation and starts listening to incoming connections with the listen() operation. When there is a new connection, the server application establishes a new connection with the accept() operation and then sends and receives data over the connection with the sendmsg() and recvmsg() operations.
The overheads of in-kernel network stacks are widely known [Han et al., 2012; Pesterev et al, 2012; Rizzo, 2012; Jeong, 2014; Lin et al., 2016; Høiland-Jørgensen et al., 2018]. Most in-kernel network stacks implement POSIX socket operations as system calls. That is, for both control plane operation, such as socket(), and data plane operations, such as sendmsg(), applications transfer control to the kernel using a system call. System calls are a problem for network-intensive applications because they have significant overheads (e.g., context switch and CPU cache pollution) [Soares and Stumm, 2010] and, recently, from OS mitigations for the Meltdown attack. Also, the POSIX socket interface is oblivious of multicore CPU and multi-queue NIC architecture. The application can access a socket on a different CPU that is managing the packet queue, which requires the OS to move packets between CPU cores.
The socket API also pushes the OS to adopt a design, which demands dynamic memory allocation and locking. When a packet arrives on the NIC, the OS first wraps the packet in a buffer object, called a socket buffer (skb) in Linux and network memory buffer (mbuf) in FreeBSD. The allocation of the buffer object puts much stress on the OS dynamic memory allocator. Once allocated, the OS then passes the buffer object down the in-kernel network stack for further processing. The buffer object lives until the application consumes all the data it holds with the recvmsg() system call. As the buffer object can be forwarded between CPU cores and accessed from multiple threads, locks must be used to protect against concurrent access.
Kernel-bypass networking
Kernel-bypass networking eliminates the overheads of in-kernel network stacks by moving protocol processing to userspace. The packet I/O is either handled by the hardware, the OS, or by userspace, depending on the specific kernel-bypass architecture in use. For example, RDMA provides interfaces for directly accessing the memory of a remote machine, bypassing the OS for data plane operations altogether. In other words, an application receives messages in an RDMA-managed memory region without any inference from the OS. For Ethernet, the OS can dedicate the NIC to an application (e.g., DPDK), which programs it from userspace, or the OS can continue to manage the NIC by allowing applications to map NIC queues to their address space (e.g. Netmap). Either way, packets flow from the NIC to userspace with minimal interference by the OS.
With the OS limiting itself to managing packet I/O, userspace is responsible for implementing the rest of the network stack. In practice, this means that userspace must at least implement the TCP/IP protocol suite and provide interfaces for applications to access messages carried over by the protocols. Various userspace network stacks exist, but none of them have become a standard. Development and testing of the stacks are therefore fragmented, which limits their usefulness. Also, while it is possible to implement POSIX sockets API as a library [Hruby et al., 2014], most userspace stacks provide their interfaces, which limits adoption and compatibility.
Programmable packet processing
Programmable packet processors are emerging as another technique to address the limitations of the in-kernel network stack. They allow execution of user-defined code either in the OS or the hardware. XDP is Linux’s programmable packet processor. It allows a user-defined eBPF program to process a packet before it enters the in-kernel network stack. The eBPF program can either process the packet in full, perform some preprocessing and forward it to the in-kernel stack, or, with AF_XDP, forward the packet to userspace memory buffer after processing. Also, some SmartNICs, such as the Netronome Agilio CX, are capable of running eBPF programs directly on hardware. Offloads packet processing from the CPU to the NIC can reduce packet processing latency and improve energy-efficiency.
Applications can leverage XDP by offloading some parts of their request processing in an eBPF program. At a high-level, an eBPF program for XDP could look as follows:
#include <linux/bpf.h>
#include <linux/if_ether.h>
SEC("xdp_prog")
int xdp_program(struct xdp_md *ctx)
{
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
int action = XDP_DROP;
struct ethhdr *eth = data;
if (ethernet_packet_matches(eth)) {
action = XDP_PASS;
}
return action;
}
In the above example, xdp_program()
is the entry point for the eBPF program. The XDP subsystem passes the start and end pointers of a packet to process. The value returned by xdp_program()
determines the action to take on the packet. In our example, the program drops packets by default with XDP_DROP
. If the ethernet_packet_matches()
function returns true, the program returns the XDP_PASS
action instead, which allows the packet to enter the in-kernel network stack. For a detailed tutorial on XDP, please refer to Andy Gospodarek's and Jesper Dangaard Brouer's talk XDP for the Rest of Us.
Summary
POSIX sockets API and the traditional in-kernel network stacks are not suitable for leveraging the hardware today. Kernel-bypass networking is a promising technique to address the issues, but they still lack standard interfaces and protocol implementations for applications to take advantage of them. Programmable packet processing is an exciting emerging technique for applications that can offload their request processing in part or in full to either in-kernel virtual machine or SmartNIC.
References
Jonathan Corbet. (2015). Improving Linux networking performance. https://lwn.net/Articles/629155/
Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. (2012). MegaPipe: a new programming interface for scalable network I/O. In Proceedings OSDI’12.
Tomas Hruby, Teodor Crivat, Herbert Bos, and Andrew S. Tanenbaum. (2014). On sockets and system calls minimizing context switches for the socket API. In Proceedings TRIOS’14.
Toke Høiland-Jørgensen, Jesper Dangaard Brouer, Daniel Borkmann, John Fastabend, Tom Herbert, David Ahern, and David Miller. 2018. The eXpress data path: fast programmable packet processing in the operating system kernel. In Proceedings CoNEXT ‘18.
Eun Young Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. (2014). mTCP: a highly scalable user-level TCP stack for multicore systems. In Proceedings of NSDI’14.
Xiaofeng Lin, Yu Chen, Xiaodong Li, Junjie Mao, Jiaquan He, Wei Xu, and Yuanchun Shi. (2016). Scalable Kernel TCP Design and Implementation for Short-Lived Connections. In Proceedings ASPLOS ‘16.
Aleksey Pesterev, Jacob Strauss, Nickolai Zeldovich, and Robert T. Morris. (2012). Improving network connection locality on multicore systems. In Proceedings of EuroSys ‘12.
Luigi Rizzo. (2012). Netmap: a novel framework for fast packet I/O. In Proceedings of USENIX ATC’12.
Karl Rupp. (2018). 42 Years of Microprocessor Trend Data. https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/
Livio Soares and Michael Stumm. (2010). FlexSC: Flexible System Call Scheduling with Exception-Less System Calls. In Proceedings of OSDI ‘10.