EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Autoscaling in Kubernetes: A Primer on Autoscaling

First of a three-part series exploring application autoscaling in Kubernetes

Sasidhar Sekar

Published in

Expedia Group Technology

6 min readNov 17, 2020

An empty notebook, a pen and a cup of coffee — Photo by Freddy Castro on Unsplash

I will look at some key drivers for autoscaling, and by the end of this article, you will be able to create a set of acceptance criteria to evaluate the suitability of any autoscaling solution, including the ones designed for Kubernetes.

Life without autoscaling

Consider a Kubernetes service with 2 replicas, each capable of handling a maximum of 600 requests/sec. The service distributes the load evenly* across its replicas so in total can handle 1200 requests/sec.

* Even load distribution is rarely a given in any system. Kubernetes is no exception. If you are interested in learning more about the factors that influence load distribution in Kubernetes, read through this excellent article by Vinod Canumalla

Case 1: peak workload < maximum available capacity

Take the example where the peak workload on the service is 1000 requests/sec.

Maximum available capacity for the service = 2 * 600 requests/sec 
                                           = 1200 requests/sec

With the peak workload < the maximum available capacity, the state of the service can be illustrated as shown below.

A diagram showing a healthy service when the actual load is less than the peak capacity of the service — Peak workload < maximum available capacity

With even load distribution, load per replica (500 requests/sec)is less than the maximum capacity per replica (600 requests/sec). So, the service is able to handle the entire workload without any distress.

Case 2: peak workload > maximum available capacity

In the same example, let’s assume that a market event leads to an increase in the peak workload from 1000 requests/sec to 1500 requests/sec. In this case:

Peak workload (1500 requests/sec) > Maximum available capacity of the service (1200 requests/sec)

Under these conditions, the following image best illustrates the state of the service.

A diagram showing an unhealthy service when the actual load is greater than the peak capacity of the service — Peak workload > maximum available capacity

As illustrated above, the service is subject to more load (750 requests/sec, per replica) than it can handle (600 requests/sec, per replica). This leads to a good portion (20%) of the customer requests experiencing delays at best and completely failing at worst — an undesirable outcome.

Mitigating scalability failures

Without an autoscaling solution in place, the traditional approach to mitigating such scalability failures involves:

an alert (on degradation/failures)
intervention by a human operator
root cause analysis
scaling out the number of replicas

A diagram showing how a human operator handles scalability failure by adding additional replicas to the service under stress — Mitigating scalability failures: manual scaling

This approach does work but has the following problems:

There is a likely delay between the alert and the intervention. Even if the operator is on call 24x7, it can take some time to intervene — for example, the operator might have to log in to production to intervene or he/she just went to get a cup of coffee
Scalability failure is not the only risk to systems reliability. This means a human operator, more often than not needs to do a root cause analysis, however brief it is, to identify and understand the cause of failures. This further delays any action
And finally, it is unlikely that human operators are observing a single service that they know everything about. It is more likely that they are monitoring everything. So, when a service endures scalability failure, the human operator needs to get information on the service, like peak capacity and current load before calculating the required number of replicas to handle the current load.

Number of replicas required = current load / peak capacity

This process assumes several things like that the capacity of each service is documented, updated regularly, and is readily available to the operators handling scalability failures — all of which are possible risks to the entire process, not considering the human errors in calculation and/or operation.

A manual scaling approach is not only slow but also error-prone.

How can autoscaling help?

With autoscaling, the role of the human operator is taken by a (set of) software component(s), the autoscaler.

A diagram showing how an autoscaler can efficiently replace the human operator for handling scalability failures — Mitigating scalability failures: autoscaling

In this case, the autoscaler monitors the current level of usage, calculates additional scalability requirements based on the codified capacity, and finally scales out the number of replicas to handle the additional workload.

The benefits of this are:

The peak capacity of services tends to be codified instead of documented. As with most things in code, this tends to be reviewed regularly and is less likely to be outdated, as compared to stand-alone documentation.
Unlike humans, the autoscaler does not need a coffee break. It is expected to be always on the spot to respond to any scalability triggers.
The autoscaler concerns itself with one task and one task only (i.e.) respond to scalability triggers. For example, if the autoscaler is configured to scale in response to CPU usage and the target utilization exceeds the configured threshold, the autoscaler scales out the number of replicas. Because of the focus on a single concern, there is no delay due to root cause analysis to delay the autoscaler’s response.
In addition to the above, because of the low operational overhead of the entire process — of monitoring, decision making, and scaling, autoscaling can be useful not just during unexpected workload increases but can also help reduce the infrastructure cost, even under a steady workload. Figures below illustrate the cost benefits of autoscaling.

With autoscaling, replica count increases/decreases with the workload. Without autoscaling, replica count remains same always — Autoscaling: cost benefits

As illustrated in the above figure, even with a steadily increasing/decreasing workload, autoscaling can help operate more efficiently by scaling in and out as required. Without autoscaling, the service needs to be provisioned for the peak workload expected. In the above example, 4 pods will be provisioned all the time, even when the actual load can be handled with 1 pod (9:00–10:30 AM, for example)

Note: In the above example, the pods are configured to operate at an average CPU usage of 80%. If the average CPU usage across all pods goes above 80%, additional pod(s) will be spun up by the autoscaler. Hence, at 10:30 AM, even though the CPU usage across pods is <100%, an additional pod comes up because the target usage threshold of 80% has been crossed.

An autoscaling solution can make scaling faster, less error-prone and more cost-efficient

Acceptance criteria for any autoscaling solution

As detailed so far in this blog post, there are several benefits to using autoscaling as a scalability solution, as against over-provisioning or manual scaling. The corollary of this is that a good autoscaling solution should be able to deliver the following benefits:

Reliability — Must guarantee scalability
Efficiency — Must reduce the infrastructure cost, as against over-provisioning/manual scaling
Responsiveness — Must scale-out fast enough to successfully handle an increase in workload
Resilience— Must protect against malicious traffic (To be elaborated on, in the subsequent posts)

These can act as the acceptance criteria to drive and evaluate the Kubernetes based autoscaling solutions that will be the focus of further posts in this series.

Learn more about technology at Expedia Group