Scaling to billions of requests on top of AWS EKS

A brief introduction

At FollowAnalytics, we help companies to understand how their users behave in their mobile and web apps, and to retain them through highly-targeted marketing campaigns, using push and in-app notifications. Delivery is scheduled or automated based on many levels of granularity, including device information, general app activity, specific actions taken and pages viewed at any point in time, app crashes, and much more.

In order to have that amount of information to better segment the app audience, we ingest a huge amount of data generated by millions of mobile devices and web browsers from France and several other countries over the world. That way our platform can learn a bit about every single user that is interacting with our client’s app and then target the best users for the type of campaign that is being sent, leading to much higher conversion.

Some numbers

During the last few months, we handled billions of requests, helping our clients to perform better Analytics on the data generated by their clients when they interact with the mobile app. According to several pieces of research, analytics is a source of competitive advantage, and our clients know that.

We served more than 20 billion requests over the last 5 months. During the night, sometimes we handle between 5k and 20k requests per second, with an average of 1 million requests per minute.

Number of requests per second reaching our AWS ALB

To make it even more challenging, some of our clients send millions of push notifications within a period of 1 or 2 minutes. Our infrastructure is able to send up to 100 million pushes within a minute. That’s one of the most efficient push server delivery rates in the market up to now, considering the published numbers. According to studies, on average 5 to 10 percent of the devices receiving notification will open the app in the next few minutes. That said, we experience huge spikes in our platform when our clients target millions of devices and we should be able to scale the platform fast to be able to handle the requests without issues. You might know how hard is to deal with spikes like that.

Some of the spikes that we have to handle on the daily basis.
The number of requests received within 1 minute during the spike.

One of our clients was covering the World Cup this year, which also led to huge traffic that is shown on the graphs below.

Based on the spikes represented on the following graphs, I challenge you to guess which game was that and when. The second graph shows how our infrastructure was able to adapt itself in a way that it could handle all the incoming traffic.

Spikes received during the final match of the World Cup 2018.
Our CPU usage (the number of cores) increasing as the requests arrive. As a consequence, our pods scale up and down automatically to be able to handle it to prevent errors and timeouts.

Before AWS EKS

Based on our numbers, the only way to offer a high-quality service is by investing in a lot of cutting edge technology. To accomplish high availability and to be able to scale to several millions of requests per minute, we decided to run our big data pipeline on top of Kubernetes. Since AWS didn’t offer the EKS service in Europe until Sep 5, 2018, we had to host it ourselves. We ran Kubernetes for 1 year by using Rancher, a great tool to manage Docker containers that we still use to manage our distributed databases (Cassandra and Kafka).

It’s not very complicated to setup Kubernetes on top of Rancher, the platform makes it really simple since it manages the installation of all Kubernetes components like etcd, kubelet, etc. However, you still have to take care of the control plane nodes of Kubernetes, and that’s why it’s so critical, since the magic behind all the scheduling, routing, etc runs on the control plane. If your control plane goes down, your services may be up but you won’t be able to schedule any more pods. If the cluster needs to scale up based on your HPA policy, it won’t. If you need to deploy a hotfix to production, it won’t work either. If you need to remove a service from production, you can’t.

To be able to achieve high availability, we had our control plane nodes distributed in 3 different AWS AZs located in Ireland. The etcd nodes, where Kubernetes stores the state of the cluster, were distributed in these availability zones as well. Rancher calls that strategy Resiliency Planes and the docs can be found here. You also have to take care of the etcd backups and restores. That way, if you have to migrate your kubernetes cluster somewhere else you can recover all the previous state. We did that by using AWS EFS and Rancher’s automatic backup for etcd. We also simulated cluster recovering multiple times, so when things go wrong we would know exactly what we should do. If you want to be reliable and feel confident about it, you should do these simulations either automatically (Chaos Engineering) from time to time or manually (maybe by practicing GameDays).

We had to scale our etcd twice during the time we ran our self-hosted Kubernetes. At some point, the load was so high that the cluster became very slow, which is a bit scary when you have all that traffic coming in. If you already heard that Site Reliability Engineering (SRE) is about “Changing the tires of a race car as it’s going 100 miles per hour” (Andrew Widdowson), you will totally feel it when you’re updating your Kubernetes cluster without using a managed service like AWS EKS.

Even with all the challenges around managing our own Kubernetes cluster, we managed to offer a highly available service, keeping our infrastructure availability at 99,99%.

Migrating to AWS EKS

Last year AWS released EKS in general availability. First in the USA, and, some months later, in Ireland. Due to our GDPR compliance policy, we waited for the service to be released in Europe. Even though we were able to handle all the operations by maintaining our own Kubernetes cluster before, it’s a very painful and time-consuming task. And this task gets more and more complicated once you have more and more traffic. The smartest decision was to move to EKS. We could still use Rancher to provision an EKS cluster by using Rancher 2.0, we decided not to use it for now. We still have Rancher 1.6 running and we didn’t want two different Rancher clusters running at the same time.

To deploy our cluster we chose Terraform as we were already using it to manage all of our infrastructure. If you’re wondering why, the reason is that all our infrastructure today is managed by Terraform and we have some workloads running in other providers, so having the same tool to provision our resources is quite handy. Personally, I also like the way Terraform is structured with HCL. If it makes sense for you to use AWS CloudFormation, go ahead, it’s definitely not a bad choice. It’s very powerful and integrated with the AWS services. Pick the one that better fits your needs.

As we deal with large volumes of data, some minutes of downtime become really critical, so we had to migrate from our old cluster to the new seamlessly. We accomplished that only by changing our Route 53 alias to the new load balancer and waiting for the traffic to be 100% handled by the new cluster before shutting down the old one. The graphs below show the live migration. During the migration, our CI was configured to deploy to both clusters at the same time in case we needed to apply some hotfix. In the end, the migration was completely transparent for our clients, that were able to perform their analytics and still send push messages. Not even our developers noticed the transition. The only thing that changed for the developers was the way to connect to the new cluster, which is now more secure. To be able to connect to the Kubernetes Dashboard, the developers need to generate a token that expires every 15 minutes. Also, IAM Roles are linked to RBAC rules, providing us the ability to limit their accesses very easily.

Graphs showing the live transition between our self-hosted Kubernetes and EKS.

Conclusion and next steps

The main advantage of migrating the cluster was not having to take care of the Kubernetes control plane anymore and not to think about the scalability and reliability of these components, since now that responsibility is delegated to AWS. On top of that, it’s much less complicated to move our EKS cluster to another region with our Terraform script, which increases our platform ability to perform Disaster Recovery and, as a consequence, increases our reliability and availability. In terms of cost, we used to have 6 nodes to take care of the control plane (3 for etcd and 3 for the other components of the cluster). Even though AWS charges us for the management of the control plane, paying around 140 dollars for it is much cheaper than deploying 6 machines and maintaining them (which is the hidden cost). We were also able to manage access control based on our IAM Roles that are attached to developers based on their needs and seniority. That is a big win in terms of security. On top of that, the cluster itself is now much faster.

The next step will be to move our Kafka cluster to AWS MSK and to make Cassandra even more scalable when running on top of EKS. We’re very confident that with the big data ingestion pipeline that we have and the cluster being partially managed by AWS we’re able to handle much more traffic and perform analytics as never before.