Performance Optimisation using SO_REUSEPORT

Published in

High Performance Network Programming

5 min readMar 17, 2022

Hey folks,

in this article I’ll go through the process of optimising a software router by parallelising it using SO_REUSEPORT. Furthermore, I’ll tackle some pitfalls that need to be kept in mind using this feature.

SO_REUSEPORT lets multiple sockets bind to the same port, allowing to distribute traffic to them without having a single socket as bottleneck. The Linux kernel distributes the traffic to the listening sockets using a hash function of the 4-tuple (src/dst IP and src/dst port). The first socket needs to set the SO_REUSEPORT option, to allow other subsequent sockets to bind to the same port. Simply using multiple sockets in parallel instead of a single one for network bound operations sounds like an awesome option, but there are some pitfalls that need to be considered. So let’s dive into an example.

How to configure SO_REUSEPORT

To test SO_REUSEPORT in a real-world example, my goal was to parallelise the SCION Border Router. SCION is a next-generation network architecture that avoids particular issues of the current Internet by design, e.g. trust issues, BGP hijacking or transparent path control. SCION Border Routers are responsible of forwarding packets between autonomous systems, and therefor target specific performance requirements. In the current implementation, there is a struct called DataPlane that is responsible for reading packets from all configured SCION interfaces (connecting local hosts or neighbour Border Routers) and forwarding them to their destination. My idea was quite simple: Instead of running one DataPlane in the router, let’s use SO_REUSEPORT to run multiple instances in parallel. The cool thing of the DataPlane struct is: it does not share any state with other components of the applications. Consequently, it can be easily started multiple times. Since the Border Router is implemented in Go, let’s see how we can set this option in Go:

Parallelise the application

That’s basically everything to setup SO_REUSEPORT. Now we need to run multiple parallel DataPlanes by wrapping the respective code that runs the DataPlane in a for loop and add some go routines and waitgroups, to ensure that all instances run concurrently and the application waits correctly. Sounds easy, right? Now there is one pitfall: In the SCION Border Router, there is some control logic that ensures links are up and running, called Bidirectional Forward Detection (BFD). This logic sends control packets over the links between Border Routers in both directions. Now that we have SO_REUSEPORT configured, each DataPlane sends this packet. Since the packets are all sent from the same source port, there is no guarantee on which socket the packets arrive. Consequently, two Border Routers can no longer sync each other using BFD. However, the fix is not that complicated. We share the same BFD session information between all DataPlanes to receive BFD packets and pass them to one controller. Furthermore, only the first DataPlane sends out BFD packets. That’s it!

Performance benchmarks

To evaluate how the number of concurrent DataPlanes affect the forwarding performance, we run the SCION Border Router with 1–10 DataPlanes (since we have 10 CPU cores) and send around 9 Gbit/s SCION traffic on a 10 Gbit/s link to it (using 10 different sockets with different port numbers). We measure the forwarding performance of the router by using nload on the outgoing network interface. In the next figure, we show the achieved forwarding performance with a different packet size based on the configured MTU.

We observe that the forwarding performance of the SO_REUSEPORT Border Router nearly increases linear starting from around 1.3 Gbit/s for 1 DataPlane (MTU 1500) to 7.5 Gbit/s with 10 DataPlanes. Using larger frames (MTU 3000), the performance increases in a similar fashion while starting at 3 Gbit/s for 1 DataPlane and reaches up to 9 Gbit/s for 10 DataPlanes. Since 9 Gbit/s is the incoming bandwidth at the router, we assume that it nearly forwards all incoming packets, which is fantastic for close to 10 Gbit/s.

What is the catch?

So you may ask: that was quite easy, where is the catch? The point is, that this kind of performance increase through concurrency is only possible if the 4 tuple of the incoming packets differ, to achieve an equal distribution of traffic to all listening sockets. If we consider forwarding between two SCION Border Routers, this is not the case, since all sockets sending outgoing traffic listen behind the same port number. Now that’s the point where we can use another awesome feature of the Linux kernel: the eXpress Data Path (XDP) . With XDP, we can perform operations on the packet before it arrives at the Linux networking stack. Despite having all packets handled in XDP on the same CPU core, we can now distribute the handling of them in the networking stack to different CPU cores. I could think of two different ways to achieve better distribution to sockets while having the same 4 tuple: 1) Using bpf_redirect_cpu helper to send the packet to another CPU core or 2) source port randomisation to create different 4 tuples for incoming packets having the same tuple. I’ve tried the first approach in several configurations, but it ended up dropping all packets in any case. Since I wasn’t sure about this helper solving my problem exactly the way I thought it would, I moved to the second approach. In XDP, we simply need to check if the packet is targeted to a router interface and randomise the source port. One small pitfall is to recalculate

or zero the UDP header checksum, otherwise packets will be dropped. But this approach works fine, at least for bandwidth which can be handled in XDP by a single core, which is far above 10Gbit/s if the XDP program is not that complex.

Summary

To sum it up, our adaption to use SO_REUSEPORT to parallelise the SCION Border Router was simple to implement and delivered great performance improvement. For router to router forwarding, we need a simple XDP program to create a fair distribution of packets to sockets, otherwise there will be no performance increase because all packets arrive at the same socket. The complete code can be found here. The original article was published here.

As always, please mail me any feedback you have, I really appreciate any kind of comments or additional information, and I will update this article with any helpful input I get.

Cheers,
Marten