Using Kubernetes Autoscaling to Optimise Availability and Cost

Published in

Untienots

8 min readSep 12, 2022

Here at Untienots, we have leaned heavily into the use of Kubernetes on Google Cloud Platform (GCP). We run almost all of our workloads on this compute engine, including web applications, data processing applications, and monitoring applications. It allows us to deploy our workloads in a fast, safe, and scalable manner, which above all encourages confidence. However, I’m not going to go into detail about the benefits of Kubernetes, but what I do want to get into is how we utilise Kubernetes and GCP to maximise the scalability and availability of our applications at Untienots.

Definition

To start, we should first define elasticity and how it relates to cloud providers. The biggest benefit of moving applications to the cloud is due to the elastic nature of cloud infrastructures: you only pay for what you need (compute servers in this case), and you can manually scale up and down your servers as you see fit, in a quick manner. This is a basic example, but hopefully it paints the picture.

Autoscaling builds upon this elasticity: it is smart enough to know when something can be scaled up or down, and it automatically* does the scaling for you. This is a major improvement that allows engineers to focus more on their work and less on fine-tuning the hardware that their applications run on: memory, cpu, number of servers, etc.

* After many hours of manual work by an engineer to create and optimise this “magic”

We can take these concepts and apply them directly to Kubernetes. At its core, Kubernetes consists of compute power (pods), and networking (services, ingresses)

Untienots’ Use Case

We have a very typical situation that lends itself well to taking advantage of autoscaling with Kubernetes:

We experience a lot of traffic spikes at certain times of the month.
Our base level of traffic is anywhere from 10 to 30x less than our highest traffic spike.
We can’t sacrifice response times and availability of applications to save money on compute costs.

To better visualise this, here are the requests per second on a heavily trafficked backend API over the span of 2 months. Would it really make sense to size this application’s pods to always be able to accommodate 50 req/s?

Those traffic spikes would scare an on-prem server admin

Let’s zoom in a bit on that first spike that happened on May 31, and see how our API’s pods handled it…

As you can see, the number of pods dynamically scaled up from their base number of 2 pods to 8 pods, and then back down after traffic began to return to normal. This is a typical traffic pattern at Untienots, and we greatly benefit from being able to efficiently serve users while using only the compute resources that we need to use.

The benefit here is that we only use the compute power (number of pod instances) that we need to at any given time. No need to over or under-provision.

By dynamically scaling our Kubernete pods, our GCP Compute Instances also scale dynamically which, in the end, saves us money and emphasises the elastic nature of the Cloud → only pay for what you need.

Node autoscaling?

Yes! This is a necessity when dealing with Kubernetes pod autoscaling. Pods run on Kubernetes nodes, which are just VMs, and we need to ensure that our VMs have enough CPU and memory for the pods that we need to run. In GCP, VMs are called Compute Instances, however in the Kubernetes world, we just refer to them as Nodes. If pods are autoscaled up in number of replicas to handle traffic, but our nodes don’t have enough CPU or memory to accommodate them, then these new pods won’t be able to run. We need to dynamically scale our nodes as well.

Pod and Node Scale Up

Pod scale up

The diagram below shows pod autoscaling scale up: a new pod is added because there is increased traffic (increased CPU usage on the pods, for example).

We use a Kubernetes Horizontal Pod Autoscaler (HPA) that allows us to define thresholds (cpu, memory, or even custom metrics) that warrant a pod scale up. The HPA also defines the minimum and maximum number of pods to scale between.

Below is an example HPA that has a min of 2 pods, a max of 10 pods, and it creates a new pod when the CPU usage for the deployment (cpu usage across all of the pods) is above 70%.

Node Scale Up

However, what happens if an additional pod needs to be created, but a node does not have enough CPU or memory? Node autoscaling!

The implementation of Node autoscaling will vary depending on your public cloud provider. We use GCP to run Kubernetes (Google Container Engine — GKE), which allows us to simply set an option to enable node autoscaling.

The use case of node autoscaling is simple: if there is a pod in an unschedulable state due to lack of node cpu/memory, then a new GCP node will be created.

One downside of this is that it can take ~2 minutes for a new node to be created and ready to run pods. There are a few workarounds to this:

Always have an extra node “on standby” that scaled pods can use if there is not enough room on the current nodes. The upside is speed: the new pod is instantly available. However, the downside is that you are paying for an extra node that you won’t use most of the time.
Leave a little bit of extra room on your current nodes to account for a few extra pods every now and then. This is what we do at Untienots.

Pod and Node Scale Down

Pod Scale Down

Scaling down pods is equally as important as scaling them up; we want to be sure that the traffic spike is finished before scaling back down. We can also define scale down behaviour in the HPA template.

The code below contains the same HPA template that we saw earlier, except it now defines a scaleDown behaviour. This behaviour is actually the default scaleDown strategy (e.g. leaving it empty will do the same thing), but I included it here to better illustrate how scaleDown works.

The behaviour is as follows: if the total CPU usage is less than 70% for 300 seconds, then scale down 100% of the deployed pods until you reach the minReplicas. Generally, using a stabilizationWindowSeconds of 300 is a good estimate that will allow you to have confidence that the spike is over.

Node Scale Down

Again, this depends on the implementation of your managed Kubernetes provider, but in GKE, nodes are scaled down if pods can be evenly distributed to a smaller subset of nodes, then node(s) are drained of their pods, and then removed.

At Untienots, when we enabled Node autoscaling, we saw our clusters decrease by multiple nodes → we were way over-provisioning!

We allowed GCP to accurately tell us how many nodes we actually needed.
We cut cloud costs by a good chunk of money.

Considerations when Implementing Autoscaling

Use multiple nodepools

In GKE (and most public Kubernetes providers), one cluster can support multiple nodepools. A nodepool is a group of identical VM types, and node autoscaling happens at the nodepool level.

This can be useful for a few reasons:

If you run stateful applications that can’t afford interruptions, you should create a separate nodepool for them, and do not use autoscaling. In fact, ReplicaSets should never be autoscaled.
You can have different node autoscaling settings per nodepool. For example, you may want to run your highly trafficked applications on one nodepool, and set the max nodes higher than the other nodepools.

Set your pod’s CPU and memory accordingly

If you are using CPU or memory for pod autoscaling, then the HPA will compare the current CPU/memory against the requested CPU/memory of the pods.

In the following example, we define our pod’s requests.cpu as 500m (half of a CPU). So if we have an HPA that says to scale up when CPU is 75%, then it means that this pod will scale up when CPU usage is 375m or higher (500m * 0.75 = 375m).

A general best practice is that you want to set your CPU and Memory requests to be as close as possible to actual CPU and memory usage. This will allow Kubernetes to efficiently schedule your pods on nodes. If you ask for too much CPU, then you are taking up more room than needed on the node, and you will need to have more nodes. If you ask for too little, then your pods will underperform (or they will be autoscaling like crazy if you are using an HPA which can lead to more app interruptions).

For more information about accurately setting your pod’s requests, go check out this blog post from KubeCost!

In short, it is best practice to accurately set CPU and memory resource requests for your pod, but it is also mandatory if you want to use pod autoscaling via an HPA.

Summing up…

Pod autoscaling (dynamically scaling pod replicas based on certain metrics), and Node autoscaling (dynamically scaling nodes to accommodate scaling pods) are used together to allow tech teams to balance availability and cloud costs. By correctly implementing both forms of autoscaling, we get closer to the true beauty of the Cloud: paying for and using only what you need.

At Untienots, we pride ourselves on hosting most of our workloads in Kubernetes, so therefore we need to accurately tune our Kubernetes resources to optimise availability, performance, and cost. By implementing Kubernetes autoscaling, we allow it to manage this optimisation for us.

Fitting pods on nodes is similar to a certain retro game that everyone loves

Interested in Kubernetes, GCP, Data Engineering, or Full-Stack engineering? Enjoy the small, startup environment? Check out our open positions at untienots.com!