At Scout24 we have started to use Kubernetes, more specifically Amazon’s managed service EKS, to run our microservices. Kubernetes, at its core, is a dynamic system that can scale workloads in response to customer demand (e.g. CPU utilisation, load balancer requests). But what happens when the cluster needs to run additional pods but is running out of resource capacity?
This is where the cluster-autoscaler comes into play…
The cluster-autoscaler is part of the Kubernetes Github organisation but it’s a purely optional tool. You need to deploy it to your cluster just like a regular deployment and configure it so that it will find and manage your node groups (Autoscaling groups in case of AWS). The cluster-autoscaler will do two things out of the box:
- add additional nodes to the node group if pods are in pending state
- remove nodes (cordon and drain node, then remove them from the node group) that are underutilised for a certain time period.
Well, that sounds great! But one of the nice properties of containers is that they start up much faster than VMs, right? When using the cluster-autoscaler in its default configuration you will have a somewhat slow and unpredictable scaling behaviour. Only when pods go into pending state, the cluster-autoscaler will react by adding a node. Since it can take several minutes for an EC2 instance to launch and be ready to run pods this is not an ideal situation if you are running services that are sensitive to such scaling delays. How can we solve this?
While there are alternatives (,) to cluster-autoscaler we wanted to stay with the most commonly used solution provided by the Kubernetes community. It turns out that there is quite an elegant way to overprovision your cluster so that it will always reserve some extra space to avoid that pods go into pending state for several minutes.
Pod Priority and Preemption are Kubernetes features that allow you to assign priorities to pods. In a nutshell this means that when the cluster is low on resources it can preempt (remove) lower-priority pods in order to make space for higher-priority pods waiting to be scheduled. With pod priority we can run some dummy pods solely for reserving extra space in the cluster which will be freed as soon as “real” pods need to run. The pods used for the reservation should ideally not be using any resources but just sit there and do nothing. You might have heard of pause containers which are already used in Kubernetes for a different purpose. One property of the pause container is that it is mostly just calling the pause syscall which means “sleep until a signal is received”.
To sum it up, we can run a bunch of pause pods in our cluster with the sole purpose of reserving space to trick the cluster-autoscaler into adding extra nodes prematurely. Since these pods have a lower priority than regular pods they will be evicted from the cluster as soon as resources become scarce. The pause pods will then go into pending state which in turn triggers the cluster-autoscaler to add capacity. Overall, this is a quite elegant way to always have a dynamic buffer in the cluster.
Why run real pods…?
The idea of running actual pods which don’t do anything might seem a bit strange at first but this approach has one big advantage. We can be sure that we can actually run additional pods, the buffer is not just calculated by some algorithm like in alternative autoscalers. Besides CPU and memory there are other resources that can be exhausted such as IP addresses when using the Amazon VPC CNI plugin like we do. By running actual pods we can be sure that the buffer can actually be used by real pods.
Scaling the buffer
Now, this already works quite well but what about scaling the size of the overprovisioning deployment itself? Today, the cluster might consist of 10 worker nodes but in a few months it could grow to 100 nodes or more. It would be ideal if the buffer space would grow proportionally with the cluster size. It turns out there is a solution for exactly that called cluster-proportional-autoscaler. You can apply it to your K8s deployment and specify a coresPerReplica value. What this means is that one replica of your deployment makes up for the specified amount of cores of the whole cluster. To give a concrete example: if your deployment uses 4 CPU cores and you want to ensure a buffer of 10% you need to set coresPerReplica to 40. This just means for every 40 cores of the overall cluster size there will be one replica (of 4 cores).
You might have noticed that the cluster-autoscaler itself actually has no notion of overprovisioning at all but we can still achieve it through a combination of Kubernetes features in a quite simple and elegant manner. The whole process is also described in the cluster-autoscaler FAQ.