Getting to 40G encrypted container networking with Calico/VPP on commodity hardware
Encryption for container networking is becoming increasingly important for compliance and security reasons. Today, vast majority of internet traffic is encrypted, and we expect Kubernetes traffic to be the same soon. There are two main approaches:
- Application level encryption which is becoming very popular with Istio/Envoy. It typically uses TLS between containers
- Platform level encryption, where all the data exchanged between Kubernetes nodes is sent through encrypted IPsec or Wireguard tunnels
Platform level encryption has the advantage that it is simple to deploy and ensures that the confidentiality and integrity of the data exchanged between containers is guaranteed, regardless of the configuration of the containers. However, so far these solutions have shown unacceptable performance characteristics, limiting the throughput between nodes to ~1Gbps  .
When we started integrating VPP in Kubernetes with Calico as a management plane , the goal was to bring the performance of VPP and the flexibility of userspace networking to containers. With its unrivalled IPsec performance, encryption was clearly an area where VPP would be able to help. Without further ado, here is the encrypted throughput we achieved between two pods on a 40G network:
This is a recording of a 1 minute long iperf test between two pods. With a 1500 bytes MTU configured in the containers (1448 bytes of TCP payload per packet), and accounting for the IPsec header overhead, this is actually 39.0 Gbit/s of traffic on the wire, which is almost line rate. Read on to know how we got there…
All of our tests were performed on two Cisco C220M5 UCS servers, equipped with Intel Xeon Gold 6146 CPUs running at 3.2GHz, and Intel XL710 NICs with 2x40G ports. Only one port of each NIC was used for the experiments. The two servers’ NICs were directly connected to each other. All the throughput tests were ran using iperf3 manually from a pair of containers, linked to VPP through tap interfaces, with VPP driving a virtual function on the NIC with the native VPP driver for XL710 (aka AVF driver). The IPsec tunnels used the AES-GCM-256 encryption algorithm.
Fast VPP / Linux integration
Being a regular userspace process, it was always possible to run VPP in a container as we do in the Calico integration. However, there was room for performance improvement in the way VPP exchanged packets with the kernel.
When it is running as a virtual router for containers in Kubernetes, VPP needs to provide kernel interfaces to the containers so that applications can run unmodified. In Linux, the simplest way for an application to inject packets into a kernel interface and receive packets sent to that interface are the TUN/TAP interfaces. The easiest way to use these interfaces is to simply open a tun file descriptor, which creates the interface in the kernel, and to use the read()/write() syscalls on this file descriptor to receive/send packets on that interface. However, as this requires a context switch for every packet exchanged, this is extremely slow.
The first improvement we did was to use the virtio backend for tap interfaces . Initially developed to accelerate virtual machines, this kernel technology allows an application to exchange packets with a tap interface through shared memory regions. The availability of packets is signaled through file descriptors. By removing the context switch and a packet copy in each direction, this backend made tap interfaces much faster. With this, the unencrypted throughput between two containers was around 6Gbit/s. Not too bad, but still very far behind a simple veth pair. At this point, the bottleneck was the kernel vhost thread responsible for sending and receiving packets on the virtio queue.
In order to reduce the per packet processing cost, Linux kernel makes extended use of segmentation offloads. These feature work by delaying the segmentation of sent packets, or by coalescing received packets early, so that most (or all if the interface implements the offload) of the kernel stack is traversed by a single large buffer. This amortizes the per packet processing cost over a large amount of data, greatly increasing the throughput. VPP now supports GSO so that Linux can send VPP up to 64kB of data in a single packet over a tap interface. This large packet is segmented in smaller packets just before being sent on an interface that doesn’t support GSO. For instance, if IPsec is enabled, then the packet is segmented before entering the IPsec tunnel. If the outgoing interface supports GSO (for instance if it is another tap, or a physical interface that implements GSO), then the packet is not segmented and sent as-is. On the reception side of things, VPP now supports GRO, where sequential TCP packets are coalesced in a single large TCP packet of up to 64kB before being passed to Linux. These two optimizations together brought the throughput to 30Gbit/s for a single flow, a 5x improvement over the previous result.
The last improvement we did was to add VPP support for multiple queues for tap interfaces. Like the virtio backend, this feature was initially added for virtual machines . This feature makes it possible to have multiple cores on Linux and VPP working on different flows in parallel. This is extremely useful to scale the performance with the number of concurrent flows and available cores. At this point, using several flows in parallel with iperf, we reached throughputs of 50Gbit/s, and the iperf server became the bottleneck, saturating one CPU core.
Once we got enough throughput with tap interfaces to saturate a 40G link, the next step was to add IPsec encryption. As Calico supports IPIP tunneling between the nodes, and in VPP IPsec tunnels can simply be configured by protecting IPIP tunnels, we added a simple configuration switch to enable IPsec on all the IPIP tunnels required by Calico. With this configuration, the throughput between containers reached 12Gbps. This is 10 times faster than the existing Linux kernel solutions , and enough to saturate a 10G link with a single VPP worker thread, but we wanted to go further than that.
Currently, the VPP IPsec implementation does all the processing for one tunnel in one direction on a single VPP worker thread. This means that configuring one more worker allows us to gain a bit of performance, as it can process the ingress and the egress on different threads. This brings the throughput to 16Gbps, but we couldn’t go further simply by adding more workers.
In order to reach higher throughputs between two nodes, we resorted to configuring several IPsec tunnels between the nodes, and load balancing between them using ECMP. This required configuring additional addresses on the physical interfaces, and two tunnels cannot have the same source and destination. With this setup, using 4 VPP worker threads, 4 tap queues, 4 IPsec tunnels configured, and configuring iperf to run with 8 parallel connections we reached our goal with the number announced above 36Gbps of TCP goodput (39Gbps on wire).
VPP is a fast-moving project and we are working on multiple features to improve performance. Most significant features will include support for Linux AF_XDP to more seamlessly pass to VPP the traffic that it needs to process, and persistent tap interfaces making it possible to upgrade entire container networking stack without stopping containers. We are also working on a better NAT implementation to support similar throughputs through service addresses. Finally, VPP 20.05 is introducing asynchronous internal crypto APIs that will be leveraged to increase the maximum throughput admissible over a single IPSec tunnel, removing the need for multiple tunnels. More to come soon …