We removed Shared-Memory by building an eBPF Load-Balancer!
As a leading content delivery network (CDN) service, ArvanCloud processes hundreds of thousands of requests per second for tens of thousands of unique domains. Every request is handled through a range of modules, based on the configuration of the domain. Managing such a high volume of information is no small feat, and ArvanCloud has responded by using a suitable architecture and efficient solutions to ensure maximum processing efficiency.
In order to optimize resource usage and achieve high levels of performance, nginx is configured in a way to have a worker process per every logical CPU core. and these workers are designed to work independently of each other. Each worker has its own independent accept queue, and is responsible for processing a request from start to finish, including establishing a connection, sending the response, and closing the connection. This approach ensures that websites and web services stay fast, efficient, and able to handle high volumes of traffic.
Nginx workers open separate sockets with the SO_REUSEPORT option, so every packet is directed to one of the workers based on a hash which by default is calculated in the linux kernel from ip addresses and ports of source and destination of that packet.
Every module requires specific data to process each request, and this data needs to be shared between all or some of the workers. Much of this data, such as configurations, is read-intensive and can be shared between workers without blocking the system using the Read-Copy-Update (RCU) mechanism. However, other types of data, such as Rate Limit counter, may change frequently over short periods of time, making it difficult to be shared efficiently.
Suppose a client with the IP address 192.51.100.1 sends a request. Requests with different ports (browser behavior), have different hashes and therefore are directed to a different worker. Also each worker is a separate and independent process that includes all necessary modules. However, because each request is directed to a separate worker, we need a mechanism to share data between them. For instance, in the Rate Limit module, we have to share counters — which hold the number of requests to a specific ip in a period of time — .
Initially, we tried to use Redis and other similar tools to store this data, but the performance was not satisfying. To address this issue, we used shared memory, which improved the performance, but it still wasn’t optimal. In fact, during the early stages of a DDoS attack, the mutable data changed very rapidly. At this time, using a lock on a part of the Shared Memory also would affect every request, regardless of the Shared Memory design (blocking/lock-free), which would reduce efficiency.
As mentioned earlier, the kernel’s default load balance strategy — which involves calculating the hash of IP addresses and ports of source and target — results in malicious requests being directed to different workers during a DDoS attack therefore shared memory will be locked.
The next idea was to change the Load Balancing mechanism in order to optimize hardware utilization and eliminate the need for Shared Memory. This could be achieved by writing an eBPF program for UDP (from kernel version 4.5) and TCP (from kernel version 4.6), which allowed us to change kernel’s default Load Balancing approach.
We decided to distribute the packets among workers based on a unique fingerprint, so that clients with the same IP address would always connect to the same worker. We also eliminated Shared Memory from all security modules and to optimize hardware utilization, we separated processing tasks from logical ones.
We modified Nginx source code and wrote a patch to connect our eBPF program.
We keep an array of listened sockets in reuse_sockmap which is initialized every time nginx runs; Each request in the socket lookup phase reaches this eBPF program and is directed to one of the sockets based on our hash function.
According to the benchmark which compares two cases:
- Using a Shared Memory
- Using our eBPF Load Balancer, and changing the load distribution method
In the second case, there is no noticeable difference in response time with an increase in the number of requests.
Conclusion
In general, if you are a service provider in scale of a CDN, you should have strategies to evenly distribute the load on CPU cores and monitor the issues constantly. Long-Lived TCP sessions can cause a lot of load imbalance, and if you modify the load balancing mechanism or kernel’s polling behavior, you should be doubly concerned about the uniformity of this distribution and have solutions for different scenarios. Additionally, these changes should not disrupt other protocols like QUIC and etc. in the future.