Scaling for a twenty times device increase in a month
Just before the start of summer, a prospect in the education sector decided to use our DNS-based solution to protect their devices and students from accessing any undesired or potentially harmful content.
As usual in the education sector, the customer wanted the devices to be ready by the beginning of the school year. It also happened to be a large deployment size, larger than the existing global footprint for our latest DNS architecture. We were facing not only a tight deadline but also an interesting scaling challenge.
We started a project to find potential bottlenecks and to determine suitable scaling steps to meet the increased demand. In this blog, I will summarize the most crucial steps we took to make sure we were ready to meet the increased load and make the deal a success.
A high-level overview of our solution
Our solution is based on a micro-service architecture deployed via Kubernetes with several shared components (mainly databases). Services focused on configuring policies and processing metadata for reports are located in our Core cluster. Services focused on DNS request processing are globally distributed via our Edge clusters. Edge clusters ensure a lightning-fast user experience by keeping devices’ DNS request delays as low as possible.
Use case analysis
Based on the customer use cases, we prepared a visualisation of the components impacted for each use case (device DNS request, device enrollment, updating device-specific policies, etc). To detect potential issues caused by the increase in load, we conducted risk-storming sessions. These sessions helped us discuss potential impacts and risks with our SMEs and as such prioritise our remediation efforts.
Load testing preparation
Our typical scaling is mainly based on observation of the metrics for the individual components. This works pretty well for linear changes in load on our infrastructure. It is much harder to use for sudden spikes as one component can influence the behavior of another more significantly than expected.
Due to the hard deadline, we had limited time to, first, figure out what and to what level needed to be scaled and then fix any found bottlenecks. We decided to squeeze the investigation into a single month to leave enough room to make the needed updates in production on time for the customer rollout. It became clear that we couldn’t fully cover every component involved in the customer’s use cases via load testing. Thus we decided to focus load testing on the most critical parts of the system and use other means to determine suitable scaling needs.
DNS request processing was recognised as the most critical from the customer's perspective and also the most demanding part of our infrastructure. So that’s where we focused our load testing efforts.
Before starting with the load testing and the testing environment preparation we had to estimate the load we needed to simulate to make us confident with our solution. We based the estimation on service metrics during peak hours. The result was that we should expect more than 30k requests per second using our DNS gateways with the underlying services. This defined the goal we had to reach with our load testing tool to provide us with the needed results and allow us to confidently tweak and observe our infrastructure performance under the desired load.
Performance testing tool
The load tests we had prior to this project were written using Gatling and executed from a single client. It was obvious we needed a more scalable solution allowing us to generate a larger number of parallel requests. After evaluation of several tools, we decided to use Gatling distributed across multiple instances with custom orchestration of the executions. Details about our experience with Gatling and the lessons learned are a story for another blog post.
Performance testing environment
As explained before, the most critical feature to test was DNS request processing. The components handling the processing are located in edge clusters with some asynchronous reporting to core (e.g. the last observed device activity). As we are familiar with building multiple edge clusters, we decided to build a new separate edge cluster used for performance testing which can be started and scaled on demand.
Due to the strict time limitations mentioned earlier, we decided against building a dedicated core cluster, and instead, we chose to connect the new edge cluster to our existing development core cluster. To prevent interference with other executions in the core cluster, we used specialized configurations to terminate the processing at the edge cluster level with an option to fully connect it with the core cluster when needed.
Load testing
While the main focus was on DNS request processing, other load tests were executed as well using the newly created testing environment. We had to properly define the load test scenarios in Gatling to simulate the actual device behaviors.
The testing was conducted by increasing the generated load in several steps, with fixing and scaling along the way until we achieved our defined threshold.
We organized a meeting with everyone participating in the load testing which enabled us to evaluate results in real-time and share ideas for improvements very quickly. As the testing environment was shared the meeting helped us with synchronization related to testing executions and environment change updates and, this way, lower the turnaround time between individual load test runs.
Discovered significant limiting factors & how we addressed them
Load balancer
Load balancer is responsible for routing requests to suitable Kubernetes resources and for TLS termination.
Originally we used Traefik with AWS classic load balancer but it started to fail even before we managed to reach the desired request loads. Due to the discovered issues we replaced the AWS classic load balancer with AWS network load balancer for the L3 level routing with additional L7 level load balancing, TLS termination remained to be done by Traefik.
As Traefik required to be scaled dramatically for our purposes, we decided that it is not actually necessary to have it as we were able to move the TLS termination logic directly to the AWS network load balancer (NLB).
Removing Traefik from the DNS request execution pipeline gave us great performance improvement but the management of certificates became rather cumbersome. By using only an L3 level load balancer it became non-trivial to achieve an even distribution of the request load to our DNS gateway instances. Thereby we introduced another variant using AWS application load balancer (ALB) which solved the downsides of the NLB.
We did a comparison of the last three solutions, where both ALB and NLB solutions improved response times by almost 25% in comparison to the Traefik one and the CPU usage was significantly reduced as shown in the picture below. We picked ALB as it fulfilled our requirements and provided greatly improved performance.
Device activity reporting
Device activity reporting is a simple periodic propagation of the timestamp of the last observed network activity for each device for the purpose of reporting. It is delivered via Apache Kafka message sent from the edge cluster to the core cluster and as a result stored in our Mongo DB. This allows the customers’ admins to easily check which of their devices are actively protected by our solution.
During the processing of the messages, we observed that updating data after each record immediately caused an unnecessary overhead resulting in high CPU usage and thus slower processing. By changing the approach to use Apache Kafka batch processing with the Mongo bulk operations we managed to dramatically reduce CPU usage as visible in the pictures below, helping to greatly increase message throughput.
Risk storming
As it wasn’t possible to fully test all the components via load testing, we conducted multiple risk-storming sessions. For each identified risk we clarified the impact and we have planned possible remediation actions. An example of an output from one of such risk-storming sessions can be seen in the picture below.
As part of one of the risk-storming sessions, we came up with an idea to spread the actual DNS request load from the specific region to multiple data centers close by using weighted routing.
The weighted routing allows us to spread the load originating from the customer region to multiple data centers located close by. This helps not only to spread the load to multiple edge clusters but also to reduce the cost for the failover solution in case of a single data center outage. The failover impact is lowered as fewer requests need to be failed over in such cases in comparison to the case when all requests from the same region would end up in the same data center due to the usage of latency-based routing. Latency-based routing is the standard solution used to route the requests to the closest available location from the network latency perspective, but it sends there all the requests from the same region. The weighted routing adds more configuration options in this regard.
Conclusion
As a result of this journey, we have a much better understanding of the components involved in handling DNS requests at scale. Moreover, we have learned how to scale massively in a single region which then gives us great confidence for further scaling globally.
The customer rollout has been a big success so far, which was heavily influenced by several factors.
- Close cooperation of all the involved teams with clearly defined priorities and responsibilities.
- Having our DNS gateway already rewritten in Go, allowing us to scale more efficiently.
- Spread the load using a weighted routing to multiple data centers.
- Optimizations based on results from load testing exercises (e.g. the increased usage of the Mongo bulk operations).
- The enrollment pace, in the end, wasn’t a single spike but it was spread over a longer period of time. This gave us additional time to include more improvements, such as horizontal pod autoscaling and removal of the Fluentd sidecar to make our solution even more cost-efficient.
Additionally, we now have an edge performance testing environment which we can start on demand to test other use cases as well.
Thanks to everyone involved in making the deal a big engineering success. Special thanks to Radovan Babic, Martin Pavlik, and Brian Stewart for the blog post reviews.