eBPF: Fantastic [Network & I/O] Speeds and Where To Find Them

Thrash Me If You Can
20 min readNov 14, 2022

--

TLDR; eBPF is a framework that let’s users run custom code in the kernel in a sandboxed way. We explore a few recent works — namely Syrup (SOSP 2021), XRP (OSDI 2022) and BMC (NSDI 2021) — which make use of eBPF in order to make networking and I/O operations faster in a safe manner.

What is eBPF?

Crudely speaking, eBPF (extended Berkeley Packet Filter) does to the Linux kernel what JavaScript does to HTML. Just like how JavaScript lets you safely run small programs on events like a button-click on your static HTML webpage, eBPF lets you safely run small programs over the static kernel code on events like disk I/O. In reality, eBPF is an execution engine with its own virtual instruction set (virtual machine), rather than the program itself — just like the v8 engine that runs JavaScript on Chrome, rather than the JavaScript itself. eBPF is part of the Linux kernel.

A superficial view of the eBPF architecture

We know, we know. We’ve probably left too many questions on your mind by now about the magic that is eBPF. So let’s try and answer them one by one.

How to write a program in eBPF? How is the program loaded into the kernel?

Programming in eBPF bytecode is insanely difficult, the same as coding in v8 bytecode. But no one codes in v8: they code in JavaScript. It’s the same with eBPF. People will use it and code in it via frameworks.

For example, a framework called BCC allows you to write eBPF code in C, compiles it to eBPF bytecode using a Clang/LLVM compiler and loads it into the kernel. Behind the scenes, BCC does this loading in the kernel using the bpf() system call.

What is stopping me from loading a program in the kernel that runs an infinite loop and ruins my friend’s laptop? :)

The eBPF verifier. Once the program is sent into the kernel after making the bpf() call, the verifier will traverse the potential paths the program may take when executed in the kernel, making sure the program does indeed run to completion without any looping that would cause a kernel lockup. Other checks, from valid register state, program size, to out of bound jumps, must also be met.

It must be evident now to readers, who have experience with kernel modules, how eBPF sets itself apart by ensuring protection against system instability caused due to unsafe code being loaded into the kernel.

How is the eBPF program made to run on any event I want?

Once verified, the eBPF bytecode is compiled just-in-time (JIT) to instructions in your machine’s instruction set. This is nearly as performant as natively-compiled kernel code.

The translated program is now attached to an eBPF hook. Hooks are locations in the kernel code (can even be in user space code, for that matter) where custom code can be attached to an event. So during runtime the hooked eBPF program will run before/after that event occurs (as required). Information about the hook is passed in the arguments to the bpf() syscall made earlier. Pre-defined hooks in kernel include system calls, function entry/exit, kernel tracepoints, network events, among others. You can also define hooks for virtually any instruction in the kernel code using KProbes.

Can I easily communicate with the eBPF program from the user space?

That’s very much possible, and quite efficiently too, thanks to eBPF maps. As implied by the name, maps are a collection of key-value pairs. eBPF supports a number of different data structures, like hash tables, arrays, and tries through maps and programs are able to send and receive data in maps using helper functions. Again, the framework that you use, like BCC, would usually carry out the mundane work of setting up eBPF maps and provide a easy to use frontend to communicate with the eBPF program using the same.

A detailed view of the eBPF architecture

Hopefully, that clears up most of the doubts that you might have had about eBPF and also gives you a more-or-less clear picture about the control flow and the architecture.

Needless to say, eBPF virtually gives the kernel superpowers. It can safely and efficiently extend the capabilities of the kernel without requiring to change kernel source code or load kernel modules. This has led to a wave of eBPF-based projects covering a wide array of use cases, including next-generation networking, observability, and security functionality.

We discuss three recent (and rather interesting) works that discuss the use of eBPF in three different domains — network scheduling, storage device I/O and packet forwarding. Before we begin here is a small cartoon for your journey ahead.

Life before eBPF (Cartoon by Vadim Shchekoldin, Isovalent)
Life after eBPF (Cartoon by Vadim Shchekoldin, Isovalent)

1. Syrup — Are you thirsty for eBPF applications in scheduling?

Syrup is a framework for user-defined scheduling used by untrusted application developers who have enough insight into their applications to know which scheduling policy will be best suited for their optimal performance. For example, workloads with requests having homogenous execution times are best served with an FCFS scheduling policy, whereas those with varying execution times work better with policies involving preemption.

Can you guess how many different schedulers are called when a network packet is received?
More than 5 schedulers are called across different layers of the network stack. One of them is obvious — the kernel process scheduler. The remaining ones are used for scheduling network packets and network connections across sockets and NIC queues, and — as you might have correctly guessed — this is where eBPF is used in Syrup.

Firstly, how does Syrup work?
Well, its simple, Syrup treats scheduling as a matching problem, matching inputs (network packets) to executors (sockets). Developers write code to specify executors for inputs, without having to worry about how the inputs are steered to the executors. This code, called the Syrup policy, is compiled by a system wide syrup daemonsyrupdand this daemon deploys it in the appropriate position in the stack based on the executor it is targetting. The syrup policy can be updated at any time and policies and applications can communicate details like load, latency statistics using a Mapabstraction (more on this later).

Hence, whether its a stack of pancakes or the network stack, some syrup will always do you good :P

Syrup is important!

Secondly, where does eBPF come in?
To enable user-defined scheduling, we must safely deploy user code in the kernel and hardware or offload scheduling decisions to user space. eBPF is used to safely deploy policies across the Linux networking stack and programmable NICs. It is also used to implement several important features of Syrup which will be discussed in detail in the next segment. Kernel scheduling is done by a different framework called ghOSt.

Let’s see in what different ways eBPF helps Syrup.

Application 1: User-defined Scheduling in different layers of the Network Stack

There is a significant difference between network stack and thread scheduling. In thread scheduling, the policy selects an available input (process) when an executor (core) becomes available. In case of networks, the policy selects an executor (socket) when an input (packet) becomes available. Executors can be software sockets which listen for datagrams/connections on ports in upper layers of the network stack, or can be hardware resources of the server such as NIC queues in lower layers. In both cases, Syrup uses eBPF maps to allow users to map inputs to these executors.

The map consists of the index (which is the key) and the values. In case of sockets, the socket and process ID serve as the value whereas in case of NIC queues, the queue ID is the value. The Syrup policy is the eBPF program written in C by the user which has the scheduling algorithm. The Syrup policy returns the map index (key) for the input which will be used to obtain the executor (value). There are various types of sockets in the Linux kernel. Syrup allows customised scheduling for AF_XDP, TCP and UDP sockets. The scheduling of the CPU core in which the packet processing will be done and the thread scheduling of the socket application is also customisable by Syrup, using ghOSt. eBPF is not used to do CPU and thread scheduling as there are very few eBPF hooks in the kernel scheduler.

Schedulers called at various levels of the Network Stack

Application 2: Isolation of Syrup Policy code

Injecting user-defined code into the kernel is fundamentally dangerous, and our Syrup policies are all user written code segments.

The policy code loaded by one application should not make unauthorized accesses to memory belonging to other applications or the kernel.

This is ensured by using the eBPF verifier mechanism which performs a number of checks before the code is loaded into the kernel. The verifier simulates the execution of the program one instruction at a time and checks for out-of-bound jumps and out-of-range data accesses, while it allows pointer accesses only after an explicit check for bound violations. Every memory access to the packet needs to explicitly check whether it goes out of bounds, using a packet_end pointer. Although only bounded loops are allowed, this drawback does not affect scheduling policies as by nature, they would not involve undefined number of iterations. Thus, the verifier avoids unwarranted memory accesses.

Application 3: Cross-application Isolation of Inputs

Syrup allows the simultaneous deployment of multiple user defined policies for different applications.

A policy loaded by one application should only have access to the inputs belonging to that application.

Things get a little tricky here. Since eBPF was designed as a system tool, developers expected it to apply system-wide policies. For example, eBPF programs in lower parts of the network stack have access to every network packet regardless of the application. This issue is dealt with by the syrupd daemon using eBPF Maps. An application sends a request to this daemon to deploy a policy, instead of directly loading the eBPF program into the kernel. This request contains the policy file name and the input-executor description. A mapping is made between the application port and the various policy requests it has made. The daemon monitors which port belongs to which application and dynamically deploys the policy ensuring each application’s program handles only packets directed to its port using the map.

Application 4: Cross Layer Communication

Runtime communication between the application code and the schedulers deployed across different layers of the stack is often crucial for scheduling performance and functionality. For example, resource allocation decisions using expensive learning-based techniques are often offloaded to user space, outside of the critical path. In Syrup, this communication is implemented by custom application eBPF Maps which are pinned to sysfs by syrupd using the user_id so that different applications by the same user can access them. In simple terms, these maps act as a common memory location read that can be updated in the user space as well as in the kernel-space eBPF code.

Syrup Network Stack Scheduling Example

Whew, that was long (But interesting, right?). Now it is time to see Syrup in action. Have a look at this short code snippet.

Syrup Policy for Network Scheduling Demo Code

Here, schedule is our Syrup policy which is used to determine which NIC queue a UDP packet will be assigned to. Here, NUM_EXECUTORS is the total number of NIC queues in our system, and any value between 0 and NUM_EXECUTORS-1 will be the index key of the input-executor map of our policy as discussed earlier, whose value will give the executor (in this case the NIC queue). The pkt_endparameter is used by the eBPF verifier while loading the policy to verify memory access bounds of the packet. This code can easily be modified for TCP packets by changing struct udphdr to struct tcphdr. This policy will be imported to the appropriate place in the network stack by Syrup functionalities (using the request to syrupd). This snippet improves throughput by more than 80% for a high-performance key-value store application.

Right, let’s move to the next neat framework built using eBPF — this time for optimizing storage device I/O.

2. XRP (eXpress Resubmission Path) — Making the kernel catch up as storage device latencies drop

With the emergence of microsecond-scale NVMe storage devices, the Linux kernel storage stack overhead has become significant, almost doubling access times.

Percentage of latency overhead of the kernel vs the hardware with 512 B random reads. With hardware improvements, kernel’s latency overhead has become significant. (Source: XRP Slides from USENIX 2022)

What is the problem with the kernel storage stack?

Any I/O request made to a device has to pass through the syscall layer (user-kernel interface), the filesystem and block I/O layer (file abstraction) and finally the storage device driver layer (physical device abstraction) of the kernel to finally reach the underlying device. That’s a lot of steps! The figure below shows the latencies contributed by each layer upon submission of a 512 B read request.

The layers in the kernel storage stack and the latencies they cause for a random 512 B read request. (Source: XRP Slides from USENIX 2022)

Well then, surely someone must have come up with a way to avoid this overhead right, seems like a lot?

One method that is popularly adopted is bypassing the kernel entirely and giving applications direct access to the underlying hardware by means of a library like SPDK. However, despite the clear speedup kernel bypass provides it has several downsides. The application has to have its own filesystem implementation, has to poll for I/O completion which wastes CPU cycles especially when I/O bound tasks are lesser and has to forgo isolation and safety provided by the kernel.

Here’s where XRP offers a neat solution.

TLDR; XRP is a framework that allows applications to execute user-defined storage functions from an eBPF hook in the NVMe device driver. This significantly reduces latency for operations that require a number of auxiliary I/O requests before obtaining the value(s) of interest by resubmitting these requests from the driver layer itself and preventing them from crossing the kernel boundary.

The specific problem we wish to tackle is reducing the latency of operations that involve making a chain of auxiliary device I/O requests. Allow us to explain what we mean. Let’s say, you are using a database application that stores data in the storage device in a B+-tree data structure. You want to obtain the data in a certain leaf node. In order to do so, you must first request (read) the root node. The root node is parsed, the pointer to the next child node is obtained and you make a request (read) — again from user space — for this child node. This process gets repeated many times till you read the requisite leaf node (see figure below).

Lookup of a leaf node in a B+-tree data structure. (Source: XRP Slides from USENIX 2022)

As the node parsing and next child pointer calculation logic is present with the application, each level of lookup necessitates sending data back to the application and the application resubmitting a read request via a system call. Note, the pointers to the intermediate nodes are not useful to us and are solely needed to reach the leaf node from the root node. It isn’t too hard to see that these repeated read requests are a big overhead. Now, we aim to minimise this overhead by some eBPF magic.

Resubmission made faster by eBPF

Here’s how eBPF is used in XRP. Continuing the example from earlier, instead of making a sequence of system calls from user space, each of the intermediate pointer lookups is executed by a eBPF function. Firstly, the application loads this function into the kernel and hooks it at some layer of the storage stack. The function intercepts the read response from the device and parses the B+-tree node to find the pointer to the required child node. This then enables the resubmission of the read request to fetch the next node, from the kernel space itself. Chaining a sequence of such eBPF functions avoids the cost of traversing kernel layers and moving auxiliary data to user space. Finally, when the leaf node is read, data is sent to the user space.

The bpfkv_bpf() function gets loaded into the kernel by eBPF. (Source: XRP Paper)

Generally speaking, this idea is used for making any I/O request involving resubmissions way faster — index lookups, range queries, average computation among others. Have a look at the figure below to solidify your understanding.

XRP lets applications load a custom function to the NVMe driver for implementing logic that only the application has information about; Number of kernel crossings gets reduced. (Source: XRP Slides from USENIX 2022)

Oh and you must be wondering where this eBPF function is actually hooked in the XRP framework. So, in order to achieve the maximum efficiency (read: bypass most of the kernel storage stack) the function is hooked at the lowest kernel layer, i.e., the NVMe driver. The developers of XRP instrument the NVMe driver with a custom developed eBPF hook and also the I/O request resubmission logic to make the XRP work as described above.

Safety of kernel code ensured

The eBPF verifier ensures that any code injected into the kernel is safe to run. The developers of XRP modified the verifier code to add extra checks in order to make sure that the extra functionalities of the new BPF_PROG_TYPE_XRP (eBPF maintains a list of program types that can be hooked in the kernel, XRP adds this as new addition) maintain the safety guarantees offered by eBPF.

The developers of XRP evaluated their framework on WiredTiger, a popular log-structured merge tree (LSM tree) based storage engine used by MongoDB, and found that it can leverage XRP to significantly improve throughput and latency. They also evaluated their framework on a custom-made key-value store called BPF-KV. They found it to reduce CPU contention and thus achieve higher throughput than SPDK (kernel bypass) when number of threads were greater than the number of cores. Also, it achieved latencies nearly-equal-to or sometimes greater-than SPDK. Not bad at all, given it additionally gets to keep the advantages of being OS integrated!

Hope we managed to give you a clear-enough idea of the potential that eBPF has for improving applications that use fast NVMe devices. Beyond fast lookups, in future, XRP can be used for many types of functions such as compaction, compression and deduplication. In addition, XRP in the future can be developed as a common interface for other use cases where computation needs to be moved closer to storage.

Alright then, you saw an application of eBPF in network scheduling, one in storage device I/O. Here’s the final one of this blog — packet forwarding.

3. BMC (BPF Memcached Cache) — Addressing the kernel network stack bottleneck for Memcached

Memcached is an in-memory key-value store often used as a cache layer by database-driven web applications to speed up requests by storing data in the RAM and reducing the number of times an external data-source must be queried. It runs as a server to which requests can be sent using TCP or UDP in the form of commands of which GET and SET are the most commonly used. A GET key command retrieves the value associated with the specified key and a SET key value command stores the key-value pair. It is quite famous, in fact Facebook leverages Memcached for their own key-value store which supports billions of requests per second.

The bottlenecks

Given how popular it is and the crucial role it serves (of speeding up requests), it must be able to sustain high network loads. However, it has been observed that Memcached performance suffers because of the OS network stack bottleneck. To put it in simpler words, the time taken to get the request to the Memcached application exceeds the time taken by the Memcached application to process it.

CPU usage of the various network related system calls for a Memcached server. (Source: BMC Paper)

More than 50% of Memcached’s runtime is spent executing system calls. Moreover, the CPU usage of network related system calls increases as more threads are allocated to Memcached. When eight threads are used by Memcached, the total CPU usage of these system calls reaches 80%.

On a side note, the traditional Memcached application uses a single UDP socket and its queue is shared between the application cores and the cores receiving the packets from the NIC. This leads to a competition for queue usage (lock contention) causing a decrease in throughput. To prevent this a modified version of Memcached, called MemcachedSR, is used, with the SO_REUSEPORT socket option which allows multiple UDP sockets to bind to the same port. Using this option a unique UDP socket can be allocated to each of the Memcached threads and the received packets are uniformly distributed between each socket queue by the Linux kernel. The modified version scales well with the number of threads.

The difference in throughput between vanilla Memcached and MemcachedSR. (Source: BMC Paper)

However, each packet still needs to go through the complete network stack which is the primary bottleneck.

Enter BMC; goodbye network stack!

BMC acts as a first-level cache for small UDP requests to the Memcached server. Now you might have the question, why small and why only UDP?

The size is kept small so that the BMC code is successfully verified by the eBPF verifier, which has been restricted to explore at most 1 million instructions currently. BMC uses a loop to copy data from a packet to its cache. In order to ensure that the complexity of the program doesn’t cross the limit, BMC is set to process keys of length up to 250 bytes and values of length up to 1000 bytes. However, this restriction is not enough to keep the program under the 1 million instructions limit. For that, the BMC is split into several small programs. Coming to it soon. Next, only UDP is supported as of now because the TCP protocol is much more complex in comparison and implementing it in eBPF code is a difficult task. Also Facebook’s workload show a 30:1 distribution between GET (UDP) and SET (TCP) commands. While this may not generalise for all Memcached applications, if the number of GET (UDP) requests is indeed very high then one can expect significant performance improvements with BMC.

BMC intercepts network packets at the network-driver level to process Memcached requests as soon as possible after they have been received by the NIC.

BMC is a collection of 7 different eBPF programs, with the complexity of each program low enough to pass the verifier. There are two primary chains of control. The incoming chain is attached to the XDP hook and is executed whenever a new packet arrives from the network driver. It contains five eBPF programs, namely, rx_filter, hash_keys, invalidate_cache, prepare_packet, and write_reply. The outgoing chain is attached to Traffic Control hook and is executed whenever a new packet is sent to the network driver. It contains two eBPF programs, namely, tx_filter and update_cache. Now lets discuss the functioning of BMC and the contributions of each of these small programs.

The break down of BMC into smaller programs. (Source: BMC Paper)

BMC checks the destination port of all packets received by the network driver to only process the Memcached traffic, the rest are simply forwarded to the network stack. This is done by the rx_filter. If the packet is intended for Memcached then it sends UDP packets containing GET requests to hash_keys and TCP packets to invalidate_cache. Next one of the following three things can happen:

Processing of GET and SET requests by BMC. (Source: BMC Paper)
  1. UDP Get Request (cache hit) — If a GET request comes, BMC checks if the key is in the in-kernel cache using hash_keys. If found, it sends the corresponding value by preparing a response packet using prepare_packet and sending it using write_reply. Thus, the packet never reaches the network stack or the Memcached application making this operation very fast.
  2. UDP Get Request (cache miss) — If hash_keys fails to find the key in the in-kernel cache then the packet is passed to the network stack and it eventually reaches the Memcached application. If a hit occurs in the application we want the BMC cache to be updated. This happens in the outgoing chain. BMC filters the outgoing packets using tx_filter which matches UDP packets coming from the Memcached application and sends them to update_cache. This program updates the BMC cache if the packet contains a Memcached GET response.
  3. TCP Set Request — If the invalidate_cache program detects a SET request it computes the key hash. BMC invalidates the cache entry (if present) corresponding to the key to be set and then forwards the packet via the network stack.

The BMC cache is a hash table shared by all cores. The indexing is done using the Memcached keys which are hashed using a 32-bit hash function. Each cache entry contains a valid bit and a hash value which BMC uses to validate a cache hit. Each cache entry is protected from concurrent access using a spin lock. This gives better performance than having a global locking scheme on the whole table.

Remember the throughput comparison between vanilla Memcached and MemcachedSR. Well, that is nothing in front of the throughput offered by BMC. Look at the speedup yourself in the figure below.

Throughput comparison between vanilla Memcached, MemcachedSR and MemcachedSR with BMC. (Source: BMC Paper)

The vanilla Memcached application at best achieves 393K requests per second using 4 cores. MemcachedSR offers better scalability and achieves 1.2M requests per second when using 8 cores. For MemcachedSR with BMC running on 8 cores, the server achieves a throughput of 7.2 million requests per second, 6.3 million being processed by BMC, the rest being processed by Memcached. This is 18x better performance with respect to vanilla Memcached and 6x with respect to MemcachedSR.

We hope now you have some idea of how eBPF can be leveraged. It somewhat makes adding new functionalities to the kernel easy. When attached to the right hooks, it can bypass several abstractions of the kernel network or I/O stack and provide much-needed speedups while being tightly integrated into the kernel.

We are excited about tools and frameworks that might get developed in the future leveraging eBPF as more and more functionalities get added to the eBPF subsystem of the Linux kernel with each new version. After having reached this far, if you’re now eager to know what kind of eBPF tools have been written till now and want to try some of them out, here’s something to start you off. If you want to learn more about eBPF internals do visit ebpf.io.

The authors of this blog (all having equal contribution to it) are Neha Dalmia, Nisarg Upadhyaya and Rajat Bachhawat. We are all fourth year undergraduate students of the Department of Computer Science and Engineering at Indian Institute of Technology Kharagpur.

References

[1] Kaffes, K., Humphries, J., Mazières, D., & Kozyrakis, C. (2021). Syrup: User-Defined Scheduling Across the Stack. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (pp. 605–620). Association for Computing Machinery.

[2] Yuhong Zhong, Haoyu Li, Yu Jian Wu, Ioannis Zarkadas, Jeffrey Tao, Evan Mesterhazy, Michael Makris, Junfeng Yang, Amy Tai, Ryan Stutsman, & Asaf Cidon (2022). XRP: In-Kernel Storage Functions with eBPF. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) (pp. 375–393). USENIX Association.

[3] Yoann Ghigoff, Julien Sopena, Kahina Lazri, Antoine Blin, & Gilles Muller (2021). BMC: Accelerating Memcached using Safe In-kernel Caching and Pre-stack Processing. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21) (pp. 487–501). USENIX Association.

--

--

Thrash Me If You Can

We post computer systems related blogs. Joint profile of Neha Dalmia, Nisarg Upadhyaya and Rajat Bachhawat, pursuing B.Tech in CS @ IIT Kharagpur.