Scaling Kubernetes for 25M users

tl;dr: when moving more than 500 containers to k8s we ran into some problems. Scroll and read the big titles below to see what they were and how we solved them.

MEE6 has been growing fast, going from 0 to 25M users in about 18 months with the CEO also serving as the sole developer.
Such a rapid growth meant accumulating technical debt quite fast. So we recently made the choice to rewrite the bot from the ground up to make it easier to fix the most important issues we were facing.
One of them was how the product is deployed: historically we used a single dedicated bare-bone server running a few Docker containers orchestrated using shell scripts.
Although this was great for an MVP, that became very painful over time:

Having so many users relying on you gives you the responsibility to have a good SLA. While most Discord bots suffer from a poor SLA, we are proud to have a 99% SLA.

For all these reasons, we decided to finally use Kubernetes to orchestrate containers over a cluster of servers thinking that maybe if we did this well enough, we could fall asleep without the anxiety of being called in the middle of the night to recover from a crash.

After taking the time to rewrite most of the code and split it into smaller containers, we thought the migration to Kubernetes was going to be painless: k8s is very well documented and creating the yaml config was easy. However, when deploying, we ran into a lot of problems we could only notice at scale. Here are some of the problems we ran into while trying to run our 664 pods in production so maybe it can be helpful when you run into these problems:

1) There is a limit on how many containers you can run per node

Right now, Kubernetes will by default schedule at most 110 pods per node. This is something you can configure if you have a direct access to your kubelet configuration. See documentation here (search for --max-pods).

Also, if you try to schedule more pods, you will quickly run into some inotify limits (namely max_user_instances and max_user_watches). These have to be changed on all hosts, but because adding and removing nodes should be painless, you cannot configure each node manually. The best solution we found was running a privileged Docker container within a DeamonSet that changed these limits. Deamon sets are automatically run on each node, and a privileged container has the right to change these two inotify limits for the whole host. See our example code here.

Although we changed both these things at first and made it work without problems, we finally decided to simply run more nodes with more modest specs, and thus naturally going below the pods-per-node limit. If you can afford to do this, this is probably the best solution. In our case, we went from running 6 nodes with 8 vCPUs to 12 nodes with 4 vCPUs.

2) You should set your CPU and memory requests as soon as possible

While deploying a very small version of the bot in staging went very well, going into production was different: a lot of pods started to be evicted and could not be rescheduled by k8s.
By looking at the actual CPU and memory usage of each host, we figured out that some hosts were running close to 100% of CPU and/or mem while some others were running at like 20% of their capacity.

The thing is: you have to clue k8s in about how much CPU and RAM your pod is likely to be taking. For example, in production, we run 512 containers that are really memory consuming (about 200Mib each) but not CPU consuming (about 5% of a vCPU each). Well, we found out that since k8s had no clue these containers would take up so much space and so little CPU so it scheduled a lot of them on the same nodes, and these nodes quickly ran out of memory forcing k8s to evict some of the pods and pause the scheduling for some others.

The same happened for some pods that were very CPU intensive, but only mildly memory consuming.

Turns out there is a way to tell k8s how much memory and CPU each pod needs so it can schedule them in a much more balanced way and it’s very simple to use. See more documentation here.

This is something you will usually not come across in a staging/pre-prod environment which is usually smaller than production ones, but believe me, they hit you hard when going live.

3) You should isolate critical pods using node affinities

As you’ve understood with 1) and 2), you sometimes feel like the deployment is going to be painless, but it ends up not working as expected and evicting some containers.

In our cases, we have some very critical containers that we can only start 4 times every 24h before being rate-limited by Discord. Every node failure that resulted in some pods going into a crash loop mode made us being rate-limited which was a pain in the arse.

We ended up creating a second node cluster dedicated to running these “critical” pods in their own nodes so even if other pods make nodes fail, the isolated pods would almost never restart and we would not end up being rate-limited.

You can do this by using affinities and anti-affinities on each pod, matching labels on the nodes. You can find more doc here. Doing this resulted in a significantly more stable product during this not-so-easy migration process, and damn does it feel great to go to sleep knowing the most critical part is physically isolated from the rest of your infrastructure.

4) You should set up instrumentation

The biggest help we had when making the shift to k8s was having visibility on everything: CPU/mem used by each host/pods/service. This was sooooo helpful that I can say with confidence that it saved us at least 2 days of work.

We used DataDog with their default Kubernetes/Docker agents and some custom StatsD wizardry and it worked like magic.

And because you’ve already read so much of this article, here is a picture of one of our dashboards.

In the end, we are able to keep the infrastructure very stable while processing 30k events per second.

5) You should review the rolling-update strategy

By default, when updating a deployment, k8s will allow 25% of the deployment’s pods to be unavailable and will allow a surge of 25% meaning you could have 25% more replicas than what you’ve asked for in the yaml.

While these are sensible defaults in many cases when handling a huge production workload, having 25% of your pods unavailable might make your product unresponsive for your users.
Also, when running tight on resources, a 25% surge might bring you so close to your nodes’ CPU/mem limits that you end up with laggy pods.

Make sure you review these settings and change them accordingly to your needs. More info in the documentation here.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store