Horizontal Autoscaling in Kubernetes

Aharon Haravon
15 min readFeb 27, 2024

--

The Force Awakens — and Nods Off

In this article I will write about the horizontal autoscaling in kubernetes. The intended audience is the software developers and devops/SRE engineers with at least some elementary background in kubernetes interested in learning about auto-scaling. When I was learning this topic, I didn’t find a single straightforward article that explains all the relveant concepts, so I took the challenge and rolled one myself.

This article content was delivered in form of lecture at Heapcon conference in Belgrade, Europe on November 2023. If you prefer to watch the video, it is available below

Introduction

Know your use-case

Software should allow business to start small and grow later. Furthermore, at any given time, we want to only pay for the resources that are actually providing us some value. In this article, we will learn about the horizontal autoscaling of cloud-native applications and advantages provided by the open source project KEDA. I will gradually explain the basics, then build on that to cover some more advanced topics, and finish with fine-tunings and troubleshooting hints. While the auto-scaling beginners will probably learn many things right from the start, the patient experts in the area will get valuable insights towards the lecture end.

The pieces of the puzzle

  • HPA controller
  • Keda controller
  • Cluster Autoscaler
  • AWS Auto Scaling Groups / GCP Managed Instance Groups /
    Azure Virtual Machine Scale Sets

Horizontal autoscaling in modern clusters involves interaction of four different systems. Each of those software components has its role in making the magic work. We will closely examine the HPA controller, then learn about the Keda controller. Later on we will briefly see some important aspects of the cluster autoscaler and mention a few points on AWS Autoscaling groups.

Keda helps bring the required metrics into the kubernetes world and creates an HPA object. The HPA controller monitors the metrics according to the definition of the HPA object and reacts to its changes by adding or removing the pods in the scaled application. Cluster autoscaler in turn detects nodes are missing or unneeded and asks the cloud auto-scaling infrastructure to adjust the nodes count. Then the cloud vendor’s auto-scaling infrastructure creates new instances that are added to the cluster as additional nodes, or removes them.

Metrics API types

  • Metrics API
  • Custom metrics API
  • External metrics API

The source of the metric for the HPA Controller may be from one of the three supported APIs — metrics API, custom metrics API or from external metrics API. None of those APIs is available in kubernetes by default. To have one or more of those APIs available in the cluster, you will need to provide additional software.

Let’s note here that while metrics from metrics API and custom metrics API are properties of kubernetes objects like containers, pods, nodes, ingresses and such, the metrics in external metrics API usually relate to external entities that live outside the domain of kubernetes.

Metrics API types Overview

Metrics API

The metrics API provides cpu and memory utilization for nodes and pod containers. If your horizontal autoscaling strategy is well defined in those terms, then you do not need to use the Keda controller. All you need in that case is the Kubernetes Metrics Server that implements the metrics API.

In many cases, autoscaling an application based on CPU/memory is inappropriate. For example, you may want your machines to be fully utilized by each unit of workload. In those cases, setting a threshold for cpu or memory will not be suitable.

Custom metrics API

When your application horizontal autoscaling strategy is based on a metric related to a kubernetes resource, for example number of requests per second handled by a pod, you may configure your HPA to use the custom metrics API.

One way to add custom metrics API support in your cluster is to develop your own Custom Metrics Server. The function of this program is to feed the kubernetes with the application specific custom metrics values.

The obvious disadvantage of this approach is that we need to create another software component that has to be versioned, maintained, tested and so on. Also, currently there can not be multiple custom metrics servers registered in the single kubernetes cluster. Therefore, you would have to implement querying and serving of all the custom metrics within the cluster in this code. This may turn-out to be not feasible and definitely doesn’t sound like much fun.

Another option is that your application exposes your custom metrics on an HTTP endpoint in OpenMetrics standard format. This information is then collected by and stored in Prometheus. You also need to install and configure the Prometheus Adapter to expose those metrics to kubernetes so they will be available in custom metrics api.

The Prometheus Adapter configuration is somewhat complex. Also, it requires cooperation between the SRE engineer and the developers. There is a single Prometheus Adapter in the cluster which is usually a responsibility of the SRE team and many different applications. Therefore developers are not independent to set up the horizontal autoscaling of their service. Another disadvantage of the custom metrics approach is that prometheus — an observability stack component, becomes critical component for your application function.

External Metrics API

This is where Keda jumps in. In many cases, your custom application metric is represented as a property of a message broker, or by a state of the database or similar. In those cases, Keda will simplify your application specific metrics implementation. When applied, your application doesn’t need to expose metrics for the purpose of horizontal autoscaling. It also has simple configuration when compared to PrometheusAdapter configuration. Another win with Keda is that it will streamline the development process as the developer remains independent in horizontal autoscaling implementation. I also like that it leaves Prometheus components outside of the critical execution path. Not that there is something wrong with Prometheus, it just seems like a separation of concerns principle violation in the system architecture to me. Of course, Keda becomes part of the critical execution path, but that is somewhat expected.

Keda does its magic by out of the box integrations with multitude of brokers, databases and other middleware systems. It exposes the information from them in the external metrics API. Kubernetes HPA controller can then read the metrics for the purpose of autoscaling. Those integrations are called Keda scalers.

But, what if your application needs to scale upon your proprietary metrics that can only be served by your custom software? Can we use Keda in that case?

Sure, Keda is still there to help as it comes with a Prometheus and Metrics API scalers. You may use it in combination with Prometheus, avoiding the previously mentioned disadvantages in use of the Prometheus Adapter. Or you may use it with so called Metrics API scaler that will directly query your application REST endpoint to retrieve the metrics values.

So, after the overview of different metrics APIs, and a brief taste of the Keda advantage, it is time to get a look at the HPA definition and explain it a bit.

The HPA object

The HPA object fields

  • minReplicas
  • maxReplicas
  • scaleTargetRef
  • metrics
  • behavior

The following fields define an HPA: minReplicas and maxReplicas that control the number of pods range within which the HPA controller will grow and shrink the autoscaled application, scaleTargetRef pointing to the kubernetes deployment, a statefulset or a custom resource being scaled horizontally, metrics section that enlists metrics the autoscaling is based upon and behavior — which fine tunes the auto-scaling behavior.

Examples

A CPU based scaling HPA specification example is presented below:

In this example, the HPA is configured to keep the cpu utilization at 50% on average across all pod’s containers in all application pods.

When using resource metrics for cpu or memory, it is important that your application declare relevant resource requests. The utilization threshold is calculated based on it so the autoscaling can not work without it.

A RabbitMQ queue length based scaling example is presented below:

Here is another HPA example configured to keep one pod per rabbitmq message. We can notice that it has all the required fields — min and max replicas, scale target reference and the metric specification.

Metrics array

The metrics array may enlist multiple metrics definition objects. The HPA will set the desired replica number to the greatest value derived from all enlisted metrics.

Metric type

Metrics type defines what is the kind of the configured metric. There are five types of metrics:

  • ContainerResource and Resource types are used for the metrics API and relate to a container’s or a pod’s cpu or memory,
  • Pods and Objects types are used with custom metrics API originated metrics and
  • External type for an external metrics API

Target type

The target type defines the nature of the specified threshold number. For each target type, the HPA controller calculates the required replicas from the metrics value in different ways.

There are three target types:

  • Utilization expressed as percentage and used exclusively for containerResource and resource type metrics,
  • Value and
  • AverageValue target type

Desired replicas calculation

Let’s see now how HPA calculates the desired replicas count from the metrica value. Generally speaking, the amount of desired replicas in each HPAs evaluation is calculated using this formula. So, the number of application replicas will be adjusted according to the ratio of the currently measured metrics and the configured amount in the HPA target section.

As usual, the devil is in the details and the documentation on how HPA calculates the desired number of replicas is rather shallow. Therefore, to look deeper into the matter, you may need to check out the HPA implementation code. Here I’ll just point you to two files in the kubernetes project that encapsulate the relevant logic — they are both in the pkg/controller/podautoscaler folder:

  • replica_calculator.go and
  • horizontal.go

‘Average value’ vs ‘Value’ target type desired replicas calculation

As we are focusing on Keda today I will present some details of how external metrics based HPA calculation works. There are two supported target types in external metrics API: AverageValue and Value type.

  • We use the target type AverageValue when we want HPA to average the observed metrics across the number of total pods, and
  • we use Value target type when the metric is already averaged so HPA will use it as is.

For example, we may use the AverageValue when the metric is queue length, and we may use the Value target type when the metric is the average time a message spends in queue.

Behavior

Now let’s see how the HPA object behavior field affects the desired replicas calculation. Here we can configure the policies for scaling up and down. Stabilization windows declare durations in which a maximal desired value takes precedence over any smaller amounts that subsequent calculations may yield. In the default behavior policy that is shown, all of the replicas may be scaled down within 15 seconds. For scale up we have two policies out of which the one yielding maximum amount of pods is applied. The application replicas number may grow either 100 percent or 4 pods within a 15 seconds window — whichever is larger. We only need to declare behaviors when they differ from this default HPA behavior.

HPA controller global parameters

The HPA controller runs inside a process called kube-controller-manager along with many other kubernetes controllers. To fine tune the global HPA parameters, we need to change the command line arguments sent to the kube-controller-manager at start time. Here is a small selection of important global HPA parameters:

  • — horizontal-pod-autoscaler-sync-period
  • — horizontal-pod-autoscaler-downscale-stabilization
  • — horizontal-pod-autoscaler-tolerance
  • — concurrent-horizontal-pod-autoscaler-syncs
  • — horizontal-pod-autoscaler-cpu-initialization-period
  • — horizontal-pod-autoscaler-initial-readiness-delay

The HPA sync-period defines the polling period of the metrics. It’s default is 15 seconds and you may want to reduce it for greater responsiveness.

The downscale stabilization with default of 5 minutes sets the time window in the past from which the scale down recommendation maximum is selected.

The tolerance sets the metric change sensitivity threshold.

If you have many HPA objects, you may want to increase the concurrent-horizontal-pod-autoscaler-syncs to ensure more responsive reactions.

For cpu based autoscaling make sure you understand well the cpu initialization period and initial readiness delay settings as they both delay the inclusion of a ready pod in the auto-scaling considerations to account for pod initialization cpu usage peaks.

The ScaledObject

ScaledObject CRD

When using Keda, we declare ScaledObject CRD which in turn automatically creates an HPA object for us and exposes the metrics from an external system of our choice in the external metrics API.

Example

The following Keda object specification is all we need to add to our application in order to have it auto-scaled based on RabbitMQ queue length.

The ScaledObject fields

In addition to the fields that we already explained from the HPA object, namely min and max replicas and the scale target reference, we declare here a few more:

  • triggers
  • metricType
  • cooldownPeriod
  • pollingInterval

The main one is the triggers section where we enlist one or more metrics definitions based on a large choice of supported middleware systems. The trigger definitions largely vary depending on the scaler of our choice. The default metric target type in Keda is AverageValue but you may declare a trigger with target type Value by adding metricType field to the trigger definition. Keda also performs the scale down to zero if minReplicaCount is set to 0. In that case, cooldownPeriod defines how long to wait before scaling down to zero. The pollingInterval sets a period for Keda’s polling of the external metric.

Alright, now, we covered the three types of metrics API, we’ve seen why Keda and external metrics API is the way to go. We’ve also learned about the fields of the HPA object and Keda ScaledObjects and we explained how the HPA calculations work for both AverageValue and Value target types.

Cluster Autoscaler

The Cluster Autoscaler overview

Now we shall see a bit about the Cluster Autoscaler function and mention a few of its parameters that we should be aware of when fine tuning the solution. The Cluster Autoscaler detects that a pod can not be scheduled due to resource constraints such as insufficient CPU or memory or due to other constraints such as affinity/anti-affinity. In turn, it communicates with the cloud provider autoscaling API to demand a new node. Likewise, when a node’s resources are free beyond a configured threshold, the Cluster Autoscaler releases the node.

The Cluster Autoscaler parameters

This list enumerates some of the important Cluster Autoscaler parameters you should consider when troubleshooting or fine tuning your auto-scaling setup:

  • — scan-interval
  • — scale-down-delay-after-add
  • — scale-down-unneeded-time
  • — scale-down-utilization-threshold

The scan-interval sets the frequency of scale up or down evaluations. It defaults to 10 seconds and should be reduced for quicker reactions to changes in workload amount. scale-down-delay-after-add defines for how long to wait after a scale up before scaling down any nodes. The default 10 minutes can be reduced to 10 seconds or so to minimize the costs. The scale-down-unneeded-time controls for how long a node should be below the threshold before it is taken out which also defaults to 10 minutes and can be shortened. The scale-down-utilization-threshold defines the unneeded node threshold as percentage with default being 50 percent. Make sure to adjust those and other cluster autoscaler parameters in order to get a smooth autoscaling that best suits your use-case.

AWS ASG in manual mode

When Cluster Autoscaler scales the kubernetes cluster, it talks with the cloud vendor autoscaling API. In AWS it is called AWS Autoscaling Groups or AWS ASG. The AWS ASG has advanced Auto-scaling capabilities of its own like scaling policy, scheduled scaling and predictive auto-scaling but when used by Cluster Autoscaler, it operates in so-called manual mode, as Cluster Autoscaler makes all the scaling decisions. Luckily, KEDA provides similar advanced features with its cron and predictkube scalers.

AWS ASG warm-pool

Provisioning a cloud virtual machine instance is not an instant operation. It takes some time to create and boot the machine, then the machine needs to be bootstrapped with kubernetes installation and configuration. To reduce the total time required to add a node to the kubernetes cluster in AWS, we may use the AWS ASG warm-pool feature. When applied, the AWS ASG prepares the configured amount of nodes by pre-launching them and running the launch template scripts that perform the installations. Then the warm poll instances are shut down. You only pay for the instance root volume disks during the time the instances are shut down. This way you enjoy faster node start-ups and therefore faster scale out of your cluster meaning a more responsive and cost effective service.

Testing sutoscaling setup

Scaling stages

Make sure to test your auto-scaling setup. When testing it, you may want to simulate load peaks and monitor timings of different stages of pod, ec2 instance and kubernetes node initializations. This list shows the events in their common chronological order:

  • Pod created at
  • Instance boot
  • Node created
  • Node ready
  • Pod scheduled
  • Pod initialized
  • Pod containers ready
  • Pod ready

Performance without any tuning

In this test, I’ve submitted 20 work-items that require 20 pods, each requiring its own node. When using regular instance groups, and default parameters for all the system components, you may expect to get results similar to ones in this chart. The pods are created between the first 0 to 50 seconds, and become ready in some 140 up to 250 seconds. In my use-case this was not good enough so I went on and fine tuned some parameters.

Performance with behaviors setup

I started with configuring the HPA behavior by allowing more than just 4 nodes to scale up at once. This already drastically improved the results. But still, it was not enough in my use-case. I needed to have pods ready within 60 seconds to avoid keeping so-called headroom — nodes without any load serving as a buffer for a spike in workload demand.

Performance fully tuned

So I went on to set up warm pools, with pre-pooling the container images in the launch template scripts and I got this result. Pods are now ready within 40–60 seconds since the need for them is detected. In my case, this meant significant savings for the company as no expensive resources need to be kept only to be able to handle spikes in workloads. I hope these hints inspire you to test and fine-tune your system auto-scaling parameters.

Closing the story

Further reading

There are many additional aspects of kubernetes that are related to horizontal auto-scaling, and were not explained or even mentioned in this session. For example, the pod priority and preemption effects, pod volumes effect, different available cluster autoscalers like Karpenter and commercial ones and so on. Equipped with the information from this session, you can easily further enlarge your knowledge and build some cool stuff.

Try it all out

To help you try it all out easily, I’ve created a github repo with terraform and kops files to create a kubernetes cluster from scratch in an AWS account. There are also two python apps — a web UI to submit workload and a RabbitMQ consumer service that emulates heavy data processing. Also, there are scripts to export the data required for the charts I presented here so you can easily reproduce them and tweak for your use-cases.

Check it out https://github.com/aharonh/k8s-autoscaling-demo

Credits for graphics used in the article

If you have reached here, than thank you so much for your attention. I hope this article will boost your auto-scaling project.

--

--

Aharon Haravon
Aharon Haravon

Written by Aharon Haravon

Seasoned software developer, architect, and leader with over twenty years of experience in Israeli high-tech startups.

No responses yet