Rust and AF_XDP; Another Load Balancing Adventure

Published in

Nerd For Tech

11 min readJun 28, 2021

I’d previously written a general purpose network load balancer for fun (but not profit) which I dubbed Convey. Though the original intention wasn’t for this to be an ongoing series, its become a fun project to continue hacking on so here we are; Part 1 and Part 2 for reference.

The previous implementations of passthrough load balancing were reasonably fast, but relied heavily on the Linux kernel for packet forwarding; so still lots of room for improvement. With the increasing popularity of eBPF and XDP I decided to jump back in and try to squeeze out more throughput. In many cases teams have decided to bypass the kernel altogether and perform all the network functions in user space (DPDK and Netmap for example). Done right this will be very fast. Of course, then the problem is our user space programs need to handle everything the kernel network stack usually abstracts from us. Recent kernels (since 4.8) have supported a nice balance to this problem; XDP.

Why AF_XDP

How would AF_XDP fit into this load balancer project? Recall in passthrough load balancing the client’s TCP session does not terminate at the load balancer. Instead, the packet is processed, manipulated and forwarded onto a backend server where the TCP handshake occurs. Internal bookeeping is needed in the load balancer to track and manage connections as they pass through, but from the client’s perspective all communication is still occurring with the load balancer. This is sometimes called Layer-4 switching.

In previous Convey implementations the packets are actually being sniffed at the load balancer (Layer 2), processed and then forwarded out through a raw socket (with IP_HDRINCL option set). The user space load balancing program sets the appropriate fields in the TCP and IP headers. To get this to work, however, some iptables settings were needed which felt a bit hacky. A cleaner and more efficient implementation would actually bypass the kernel network stack entirely, but only for the traffic we really care about.

Ideally we can filter traffic in kernel space and only pass what we’re interested in to our user space load balancing program. Anything else can still reliably be handled by the kernel.

XDP (eXpress Data Path) and AF_XDP

The idea behind XDP is to install packet processing programs into the kernel; these programs will be executed for each arriving packet. Since we’re in the kernel the programming language for XDP is eBPF (Extended Berkeley Packet Filter); a previously defined (and loaded) eBPF program decides the packet handling (i.e. pass onto the kernel network stack as normal, drop, abort, or send to a user space program). The hook is set in the NIC (network interface) driver just after interrupt processing, but (ideally) before any memory allocation needed by the network stack itself (namely SKB allocation). It turns out skipping SKB allocation can be really powerful since its such an expensive operation performed on every packet. So our eBPF program is executed very early in the packet lifecycle, before the kernel can even operate on the packet itself (so no tcpdump or qdiscs for example). For more background on eBPF checkout the excellent Cilium reference.

XDP/eBPF programs can be attached to a couple different points. The best option is to hook into the device driver before the kernel allocates an SKB. This is called Native mode but the catch is your driver needs to explicitly support this (list of driver support). A fallback option is to run in Generic mode; here the XDP hook is called from netif _ receive _ skb() which occurs after packet SKB allocation; as a result, we unfortunately lose most of the performance benefits. But at least we can still develop and test our programs using Generic mode without the need for a special driver.

When an XDP program is finished processing a packet, it should return one of the following XDP actions:

XDP_PASS: let the packet continue to the kernel network stack
XDP_DROP: silently drop the packet
XDP_ABORTED: drop the packet with trace point exception
XDP_TX: bounce the packet back to the same NIC it arrived on
XDP_REDIRECT: redirect the packet to another NIC or user space socket via the AF_XDP address family

The final piece to the puzzle is AF_XDP. An important distinction is XDP by itself would not address what we need since its not meant to be a kernel bypass facility. As the fine xdp tutorial points out XDP is an in-kernel fast-path that operates on raw-frames “inline” before they reach the normal Linux kernel network stack. Pure XDP was not meant to pass frames to user space. That’s where AF_XDP comes in. AF_XDP is a new address family that is optimized for high performance packet processing. Whereas the bpf_redirect_map() helper function would typically be used by XDP programs to redirect ingress frames to other XDP enabled network devices, AF_XDP sockets enable the possibility for XDP programs to redirect frames (still using the bpf_redirect_map() helper function) to a memory buffer (or UMEM, more on this below) in a user space application (the Convey load balancer in our case). In practice, to quickly pass raw frames into user space then, XDP can bypass the Linux kernel network stack via XDP_REDIRECT’ing into a special BPF-map containing AF_XDP sockets.

Interestingly, one of the fundamental ideas behind AF_XDP dates back to Van Jacobson’s talk about network channels. The talk focused on creating a lock-free channel directly from a driver RX queue and into a (AF_XDP) socket. Since AF_XDP bypasses all kernel queues and locks, the sockets instead utilize lock-free Single Producer and Single Consumer rings. The Single Producer ring binds to a specific RX queue id. From there NAPI softirq ensures only 1 CPU process 1 RX queue id. Likewise, the Single Consumer is a single application. The Single Producer / Single Consumer rings are largely where the performance of AF_XDP comes from. The catch then, is any synchronization needs to be handled in user space. Of course, this would be defeating the purpose and sacrificing performance; in the spirit of Van Jacobson’s lock-free channels, we’ll try to keep user space synchronization to a minimum.

More Speed Please

Streamlining Passthrough Load Balancing with AF_XDP

So it should work in theory, but how will it look in practice? For this to work, we need to 1) write and compile the eBPF program, 2) load it into the kernel (from the Convey program running in user space), and 3) setup the AF_XDP socket(s) (again from the Convey program). First, some concepts to better understand the implementation and setup.

More Helpful Concepts

The RX and TX descriptor rings of the AF_XDP sockets point to a data buffer in a memory area called a UMEM. UMEM is a region of contiguous (virtual) memory, divided into equal-sized frames. Recall, we are avoiding any per packet memory allocation. The UMEM memory area used for packets is pre-allocated and its the responsibility of the user space program to perform the allocation.
There are four different kinds of rings to concern ourselves with: FILL, COMPLETION, RX and TX. All rings are Single Producer / Single Consumer, so the user-space application needs explicit coordination if multiple processes/threads are reading/writing to them (but again, best to avoid sharing these).
The rings themselves are head(producer)/tail(consumer) based rings. This means a producer writes to the data ring at the index pointed out by the struct xdp_ring producer member, and increasing the producer index. A consumer reads the data ring at the index pointed out by the struct xdp_ring consumer member, and increasing the consumer index.
An AF_XDP is socket linked to a single UMEM (although helpfully one UMEM can have multiple AF_XDP sockets). The UMEM has two Single Producer / Single Consumer rings that are used to transfer ownership of UMEM frames between the kernel and the user-space application.
The Single Consumer rings read descriptors from a ring that points into the UMEM area.
More specifically, the UMEM uses two rings for coordination: FILL and COMPLETION. Each AF_XDP socket associated with the UMEM must have an RX queue, TX queue or both. So if there is a setup with four sockets (all doing TX and RX) then there will be one FILL ring, one COMPLETION ring, four TX rings and four RX rings. In the FILL ring: the application gives the kernel a packet area to RX fill. In the COMPLETION ring, the kernel tells the application that TX is done for a packet area (which then can be reused). This scheme is for transferring ownership of UMEM packet areas between the kernel and the user space application.
Whereas many load balancers and applications will typically scale with more CPU, this implementation only scales with additional CPU:RX queue pairs due to the semantics of the Single Producer / Single Consumer rings.

Phew! With all that in mind, lets jump into the implementation. Also, see the kernel AF_XDP docs for more of the conceptual details.

The eBPF Program

We first need to compile our own bpf program for loading into the kernel. Lets walk through whats happening:

Setup the map of AF_XDP sockets for redirecting packets to. Note this is a special BPF map, specifically for AF_XDP sockets.

struct bpf_map_def SEC("maps") xsks_map = {                        .type = BPF_MAP_TYPE_XSKMAP,                        
.key_size = sizeof(int),                        
.value_size = sizeof(int),                        
.max_entries = 64,  /* Assume netdev has no more than 64 queues */                       };

For the actual eBPF program:

Assign the name of the program so we can load it specifically later from our user space program.
Some of this is boilerplate for XDP. Our function is passed the xdp_md struct as an argument. From it we can derive the block of data we’ll need to parse.
Setup the network headers we’ll want to parse, starting with the Ethernet header. The necessary kernel header files and eBPF helpers have been imported.
From there we parse the Ethernet header to get to the IP header.
Finally we parse down to the TCP header. This is where we can decide whether this is traffic we care about (i.e. is the packet destined for the load balancer port). If so, the eBPF helper functions allow for verifying the corresponding queue id has an active AF_XDP socket bound to it ( bpf_map_lookup_elem ) and then redirect the packet to the appropriate AF_XDP socket (via bpf_redirect_map ).

SEC("xdp_filter_80")
int xdp_filter_prog(struct xdp_md *ctx)
{
    int index = ctx->rx_queue_index;
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    struct ethhdr *eth;
    struct hdr_cursor nh = { .pos = data };
    struct iphdr *iph;
    struct tcphdr *tcph;
	int eth_type, ip_type;

    eth_type = parse_ethhdr(&nh, data_end, &eth);
    if (eth_type == bpf_htons(ETH_P_IP)) {
        ip_type = parse_iphdr(&nh, data_end, &iph);
        if (ip_type == IPPROTO_TCP) {
		    if (parse_tcphdr(&nh, data_end, &tcph) < 0) {
			    return XDP_ABORTED;
            }
            if (bpf_ntohs(tcph->dest)==80 ) {
                if (bpf_map_lookup_elem(&xsks_map, &index))
                    return bpf_redirect_map(&xsks_map, index, 0);
            }

            // catch ephemeral ports as well.  
            // Comment this section out for
            // runnning in DSR mode 
            if (bpf_ntohs(tcph->dest)>=33768 ) {
                if (bpf_map_lookup_elem(&xsks_map, &index))
                    return bpf_redirect_map(&xsks_map, index, 0);
            }
        }
    }

    return XDP_PASS;
}

Anything else, we simply pass onto the kernel network stack ( XDP_PASS ). This was the original goal; filter out everything we don’t care about and allow the kernel to continue doing its thing. Now only traffic we’re load balancing will redirect to our user space program on the other side of the AF_XDP socket.

Although this is a Rust project, there is no way around writing the eBPF program in C. To build the object, I simply re-used the tooling from the very helpful xdp-tutorial. Once the sub-modules are setup, simply update the eBPF program, run make and we’ll have our eBPF object.

AF_XDP Setup

Now that we have our eBPF object with program and XSKS map ready, we can proceed with the setup from user space. We can also go back to writing Rust code. First we need to load the eBPF program and map (using the rebpf crate).

libbpf::bpf_set_link_xdp_fd(&interface, Some(&bpf_fd), xdp_flags)?;let info = libbpf::bpf_obj_get_info_by_fd(&bpf_fd)?;info!("Success Loading\n XDP prog name: {}, id {} on device: {}",
  info.name()?,
  info.id(),
  interface.ifindex()
);let _bpf_map = libbpf::bpf_object__find_map_by_name(&bpf_object, map_name)?;

With that done, we’re now ready for the AF_XDP setup. Although AF_XDP is a relatively recent development, there is already a working crate we can leverage; the afxdp crate takes care of the necessary libbpf bindings as well as many of the AF_XDP details.

We’ll want dedicated workers for receiving, processing and forwarding packets. In this case the number of workers should correspond to the number of Cores and RX queue pairs (for the reasons described above). Use ethtool -l <device_name> to see how many RX (or combined) queues are available from your device driver. Use ethtool -L to update as needed.

ethtool -l eth0

Channel parameters for eth1:
Pre-set maximums:
RX:             0
TX:             0
Other:          0
Combined:       4
Current hardware settings:
RX:             0
TX:             0
Other:          0
Combined:       4

Before we get to the individual workers, a global memory map area is setup. This will essentially be partitioned to each worker, who will then setup its own dedicated UMEM (recall we want to avoid any synchronization overhead between workers). Each worker also sets up its own AF_XDP socket and binds to a specific RX queue of the load balancer interface (again, only one socket per RX queue). That all happens here. There is also some setup for things like ARP caches and learning network routes since our user space program is now responsible for those details (remember we’re on our own here since we are fully bypassing the kernel network stack for load balancer directed traffic). We can still leverage the ever helpful /proc filesystem, however. For reference, most of that happens here or here. Since those details are less interesting I’ll skip any further descriptions.

Finally, each worker is spawned in a native thread and pins itself to a core. Before the processing loop begins, the UMEM must be pre-allocated. That happens here,with the afxdp crate handling the details (namely, wrapping _xsk_ring_prod__reserve(), _xsk_ring_prod__fill_addr(), and _xsk_ring_prod__submit()).

The Processing Loop

With all the various setup out of the way the load balancing processing loop can now begin:

First check the AF_XDP socket’s COMPLETION ring to transfer ownership of UMEM frames from kernel-space to user-space.
Next receive that batch of frames on the AF_XDP socket’s RX ring.
For any batch of frames received, process according to the load balancer logic. Note this is all done on the frame itself without any copying. That’s happening here (similar to previously).
Forward the batch of processed packets back out via the AF_XDP socket’s TX ring.
Transfer ownership of the processed packets / buffers back from user space via the AF_XDP socket’s FILL ring.
Repeat

Its important for the packet processing step in particular to be fast since the user space program is responsible for returning frames back to UMEM in a timely manner. There is admittedly some locking overhead for sharing things like ephemeral ports and tracking connections state, but in most cases the locks are read locks so overhead should be minimal.

Final Notes

I wasn’t able to get my hands on a network driver that supports AF_XDP. It appears most 10GB network cards support it, but I don’t have any of those laying around the house. The best / closest option may be the AWS ena driver, which does support XDP and bpf_redirect_map function, but not (yet) AF_XDP. Once I’m able to get a full benchmarking environment (a test environment Cloudformation template for reference) I’ll update with the hopefully impressive numbers. I will say even running some quick load tests with an unsupported driver yielded noticeable performance gains and much less variance in latency.

The AF_XDP feature of this project is maintained in a separate branch since it requires kernel 5.4 or above: https://github.com/bparli/convey/tree/feature/xdp

Shout-out again to the afxdp crate; I was really happy to come across that.

I also recommend the xdp tutorial.

References

xdp-project/xdp-tutorial

It is important to understand that XDP in-itself is not a kernel bypass facility. XDP is an in-kernel fast-path, that…

github.com

AF_XDP - The Linux Kernel documentation

AF_XDP is an address family that is optimized for high performance packet processing. This document assumes that the…

www.kernel.org

Building a XDP (eXpress Data Path) based peering router

Over the last few years, we’ve seen an increase in projects and initiatives to speed up networking in Linux. Because…

medium.com