On Kernel-Bypass Networking and Programmable Packet Processing

Photo by rawpixel on Unsplash

The network stack architecture is under a revolution because the network is becoming faster than the CPU. Since the breakdown of Dennard scaling in 2006, the single-threaded performance of a CPU has stagnated [Rupp, 2014]. Meanwhile, NICs are getting faster and faster. 10 GbE NICs are a commodity today, but at high-end NICs are already at 200 GbE and improving further. The faster the NIC becomes, the smaller the time budget to process an individual packet is. For example, on a 10 GbE NIC, the time between two 1538-byte packets is 1230 ns [Corbet, 2015]. However, with 200 GbE, the time between such packets is as already low as 61 ns. These high packet rates impose a significant challenge to both the hardware and OS network stack architecture. In short, the traditional in-kernel network stack design is inadequate to keep up with the high rate of packets.

In this article, we first look at the deficiencies of the POSIX sockets API and their in-kernel implementations. We then discuss kernel-bypass networking and programmable packet processing, including offloading to SmartNICs, to understand how the network stack is changing to meet the needs of contemporary hardware and workloads.

POSIX sockets

The overheads of in-kernel network stacks are widely known [Han et al., 2012; Pesterev et al, 2012; Rizzo, 2012; Jeong, 2014; Lin et al., 2016; Høiland-Jørgensen et al., 2018]. Most in-kernel network stacks implement POSIX socket operations as system calls. That is, for both control plane operation, such as socket(), and data plane operations, such as sendmsg(), applications transfer control to the kernel using a system call. System calls are a problem for network-intensive applications because they have significant overheads (e.g., context switch and CPU cache pollution) [Soares and Stumm, 2010] and, recently, from OS mitigations for the Meltdown attack. Also, the POSIX socket interface is oblivious of multicore CPU and multi-queue NIC architecture. The application can access a socket on a different CPU that is managing the packet queue, which requires the OS to move packets between CPU cores.

The socket API also pushes the OS to adopt a design, which demands dynamic memory allocation and locking. When a packet arrives on the NIC, the OS first wraps the packet in a buffer object, called a socket buffer (skb) in Linux and network memory buffer (mbuf) in FreeBSD. The allocation of the buffer object puts much stress on the OS dynamic memory allocator. Once allocated, the OS then passes the buffer object down the in-kernel network stack for further processing. The buffer object lives until the application consumes all the data it holds with the recvmsg() system call. As the buffer object can be forwarded between CPU cores and accessed from multiple threads, locks must be used to protect against concurrent access.

Kernel-bypass networking

With the OS limiting itself to managing packet I/O, userspace is responsible for implementing the rest of the network stack. In practice, this means that userspace must at least implement the TCP/IP protocol suite and provide interfaces for applications to access messages carried over by the protocols. Various userspace network stacks exist, but none of them have become a standard. Development and testing of the stacks are therefore fragmented, which limits their usefulness. Also, while it is possible to implement POSIX sockets API as a library [Hruby et al., 2014], most userspace stacks provide their interfaces, which limits adoption and compatibility.

Programmable packet processing

Applications can leverage XDP by offloading some parts of their request processing in an eBPF program. At a high-level, an eBPF program for XDP could look as follows:

#include <linux/bpf.h>
#include <linux/if_ether.h>

int xdp_program(struct xdp_md *ctx)
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
int action = XDP_DROP;
struct ethhdr *eth = data;
if (ethernet_packet_matches(eth)) {
action = XDP_PASS;
return action;

In the above example, xdp_program() is the entry point for the eBPF program. The XDP subsystem passes the start and end pointers of a packet to process. The value returned by xdp_program() determines the action to take on the packet. In our example, the program drops packets by default with XDP_DROP. If the ethernet_packet_matches() function returns true, the program returns the XDP_PASS action instead, which allows the packet to enter the in-kernel network stack. For a detailed tutorial on XDP, please refer to Andy Gospodarek's and Jesper Dangaard Brouer's talk XDP for the Rest of Us.



Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. (2012). MegaPipe: a new programming interface for scalable network I/O. In Proceedings OSDI’12.

Tomas Hruby, Teodor Crivat, Herbert Bos, and Andrew S. Tanenbaum. (2014). On sockets and system calls minimizing context switches for the socket API. In Proceedings TRIOS’14.

Toke Høiland-Jørgensen, Jesper Dangaard Brouer, Daniel Borkmann, John Fastabend, Tom Herbert, David Ahern, and David Miller. 2018. The eXpress data path: fast programmable packet processing in the operating system kernel. In Proceedings CoNEXT ‘18.

Eun Young Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. (2014). mTCP: a highly scalable user-level TCP stack for multicore systems. In Proceedings of NSDI’14.

Xiaofeng Lin, Yu Chen, Xiaodong Li, Junjie Mao, Jiaquan He, Wei Xu, and Yuanchun Shi. (2016). Scalable Kernel TCP Design and Implementation for Short-Lived Connections. In Proceedings ASPLOS ‘16.

Aleksey Pesterev, Jacob Strauss, Nickolai Zeldovich, and Robert T. Morris. (2012). Improving network connection locality on multicore systems. In Proceedings of EuroSys ‘12.

Luigi Rizzo. (2012). Netmap: a novel framework for fast packet I/O. In Proceedings of USENIX ATC’12.

Karl Rupp. (2018). 42 Years of Microprocessor Trend Data. https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/

Livio Soares and Michael Stumm. (2010). FlexSC: Flexible System Call Scheduling with Exception-Less System Calls. In Proceedings of OSDI ‘10.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store