EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Autoscaling in Kubernetes: Why doesn’t the Horizontal Pod Autoscaler work for me?

Final part of a series exploring application autoscaling in Kubernetes

Sasidhar Sekar
Expedia Group Technology

--

Man sitting in front of his work desk, thinking
Photo by Jason Strull on Unsplash

If you haven’t read the first two posts, you can find them below.

In this post, we’ll look at a common concern among many application owners. “HPA seems simple enough. I enabled it by following all the documentation. But it does not work for me!

Is it true that HPA (Horizontal Pod Autoscaler) does not work for certain applications?! Or are application owners doing something wrong that breaks HPA?! Read on to find out.

Before proceeding to the concerns, let’s take a brief look at how HPA is typically configured for an application. This serves as a platform for further analysis.

HPA Basics

As an example, let’s consider a service “busybox-1”. The goal is to get this service to autoscale when CPU usage exceeds 80%.

Because of the various limitations suffered by the Vertical Pod Autoscaler as it is at the moment, we implement autoscaling with HPA.

Below is a sample manifest for the HPA resource.

hpa.yaml

Note: While Horizontal Pod Autoscaling based on CPU utilization will be used as an example in this post, the concepts are equally applicable to any metric that can be used to autoscale.

Configuring the “busybox-1” deployment with this HPA resource can be as simple as running the following command.

$ kubectl create -f hpa.yaml
horizontalpodautoscaler.autoscaling/busybox-1 created

You can view the current status of the HPA resource as shown below.

$ kubectl get hpaNAME       REFERENCE          TARGET    MINPODS MAXPODS REPLICAS AGE
busybox-1 Deployment/busybox 0% / 80% 3 4 1 11m

What does this mean?

When the average CPU usage across all the pods exceeds 80% (targetCPUUtilizationPercentage), HPA will spin up additional pods. The number of additional pods spun up is calculated as shown below.

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

The minimum number of replicas that should be running at any point in time is governed by the minReplicas parameter, and the maximum number of replicas the HPA can scale the deployment up to is governed by the maxReplicas parameter in the HPA manifest.

I’m sure all of this seems simple enough that you’re now wondering what could possibly go wrong for the HPA to not work. Well, read on!

Target Utilization

The figure below considers the example of a service with a steady workload. There are 3 graphs in the figure. Ordered from top to bottom, they describe the following:

  1. Load on the service, in terms of the Total CPU usage across all Pods. Total CPU usage refers to the CPU capacity required to handle the cumulative workload on the service. For example, A Total CPU usage of 240% indicates that the service requires at least 240%÷100% = 2.4 pods to handle the workload.
    * 100% refers to the capacity of 1 pod
  2. Average CPU usage % across all the pods (= Total CPU usage % ÷ Number of pods)
  3. The number of pods running at any given point in time
Top graph has total load 40%-240%, middle graph has average CPU 40%-85%, bottom graph has replica count 1–4

The service is configured to autoscale with the HPA. As can be seen in the top right corner of the above figure, it is configured to run at the desired CPU usage of 80%, with the minReplicas parameter set to 1.

With these points in mind, let’s look at what happens over time in the above example.

  1. The workload on the service steadily increases from around 9 AM until it reaches the peak, just after midday. Then it steadily tapers off towards the end of the day
  2. Up until the first dropline (black vertical dotted line which extends down into the vertical blue dotted line), the total CPU required to handle the workload is < 80% (< targetCPUUtilizationPercentage). So, the HPA does not scale and the number of pods running = 1
  3. Beyond this point, the total CPU usage required to handle the workload increases above 80%
  4. HPA scales up the deployment to add one more replica, so the total number of pods running = 2
  5. Now, with 2 pods running and a cumulative CPU load of ~85%, the average CPU usage across all pods ~ 43%
  6. All of this is expected behavior. The HPA responds to an increase in workload by adding more replicas and manages to keep the average CPU utilization to ~ targetCPUUtilizationPercentage configured

Concerns

Let’s focus on a couple of things in the figure that we haven’t yet described — The vertical blue dotted line and the vertical orange dotted line.

  • The Blue dotted line is a marker for the time at which the 80% threshold was breached
  • Orange dotted line is a market for the time at which an additional replica was up and running

There is a time lag between detection and scaling

As can be seen in the figure, there is a time lag between these two (i.e.) there is a time lag* between when the target CPU usage threshold was breached and when the additional replica was up and running.

* The reasons for this lag will be described later in this post

Autoscaling lag = Time lag between when the target CPU usage threshold was breached and when the additional replica was up and running

Let’s now focus on the red dotted line. This is a marker for the time at which the pod would reach 100% CPU utilization if the service wasn’t scaled. Let’s assume you do not want your pod’s CPU usage to reach this level because you observe plenty of throttling at this level — leading to severe degradation and failures.

Maximum time available to autoscale = Time lag between when the target CPU usage threshold was breached and when the pod would reach 100% CPU utilization

For the autoscaling solution to be effective, a key requirement is to have autoscaling lag < maximum time available to autoscale

Solutions

One way to reduce the risk of autoscaling lag going above the maximum time available might be to reduce the value of the targetCPUUtilizationPercentage parameter.

For example, in the above scenario, if the targetCPUUtilizationPercentage is set to 40% (instead of 80%), the threshold will be breached early and it would take 3 times as long to reach 100% CPU utilization. So, this approach reduces the risk of degradation/failure considerably.

But, there is a trade-off here: the lower the target CPU utilization, the higher the number of pods required to handle the same workload (240% in the above example). This tradeoff and the associated costs is illustrated in the table below

80% target utilization, 3 pods needed, low cost; 40% target utilization, 6 pods needed, higher cost; etc
Lowering target CPU utilization for scaling gives a more responsive service, but a more expensive one too

The number of pods required to handle a given workload at 40% target CPU utilization is twice the number of pods required to handle the same workload at 80% target CPU utilization.

As explained in this section, the choice of target utilization might appear simple at the outset but application owners will do well to be mindful of the tradeoffs involved when making a decision on this key parameter.

Higher target CPU utilization = Greater risk of degradation/failures

Lower target CPU utilization = More expensive to operate

Lossless Detection

Now let’s consider the example of a service with a spiky workload. The figure below contains 2 graphs. Ordered from top to bottom, they describe the following:

  1. Spiky nature of the workload
  2. Number of running pods
Top graph: utilization mainly low but with intermittent spikes; bottom graph: 1 replica all the time, no response to spikes
Spiky Workload — Lossy Detection

The service is configured to autoscale with the HPA. As can be seen in the top right corner of the above figure, it is configured to run at the desired CPU usage of 80%, with the minReplicas parameter set to 1.

With these points in mind, let’s look at what happens over time in this example.

  1. The workload remains low for a period of time, using < 20% CPU
  2. Then there is a sudden spike, taking the CPU usage > 80% for a brief few seconds
  3. The expectation is that when the CPU usage goes higher than 80%, then HPA should spin up a new pod to handle the increased workload
  4. But, as can be seen in the above figure, HPA doesn’t do this here

HPA can fail to detect workload spikes at times

What causes this behavior?

To understand the root cause of this behavior, let’s take a look at the example Kubernetes cluster illustrated below.

Complex diagram showing kubernetes control plane and how the components interact — the text details the useful parts

The following explains various steps in the above illustration.

  1. HPA does not receive events when there is a spike in the metrics. Rather, HPA polls for metrics from the metrics-server, every few seconds (configurable via --horizontal-pod-autoscaler-sync-period flag, 15 seconds in this example)
  2. The metrics server, which HPA polls for metrics by default, itself polls for aggregate metrics over a period of time, once every few seconds (configurable via --metric-resolution flag, 30 sec in this example)
  3. In this example, HPA is configured to scale based on the targetAverageCPUUtilization metric. Two keywords that warrant attention in the name of this metric are — Average and Utilization.

Let’s look at some examples of spikes and what HPA observes during these spikes.

Example 1:

Table showing second by second CPU usage of each replica, with spike as described in the text below

The above example shows CPU usage across the 3 pods of a service over a period of 30 sec (metrics-server resolution). There was a CPU spike at T+1 in one of the pods, pushing the CPU usage of that pod to 90%. This is > 80% targetAverageCPUUtilization parameter configured in the HPA. Yet, HPA does not scale out the pods during this workload spike, because:

  • Even though one of the pods has a CPU spike > 80%, the average CPU utilization across all the pods at T+1 is only 43%
  • Add to this the fact that the metrics server serves aggregate metrics over a period (30 sec, in this example), the aggregate average CPU utilization over this 30-sec interval becomes 21% — far below the 80% target

Because of these reasons, even though there was a workload spike in one pod leading to > 80% CPU usage on that pod, HPA did not respond by scaling out more replicas.

Example 2:

Table showing second by second CPU usage of each replica, with spike as described in the text below

In this case, all the pods experienced a CPU spike > 80% at T+1. Yet, the aggregate of the average CPU usage over the 30 sec period was only 22% — again far below the 80% target. Hence, HPA does not respond by scaling out more replicas.

Example 3:

Table showing second by second CPU usage of each replica, with spike as described in the text below

In this example, the workload spike lasted for a longer period ~ 5 sec. Yet, the average CPU utilization aggregated over 30 sec = 31% < 80% targetAverageCPUUtilization. So, HPA again does not scale out the deployment.

Example 4:

Table showing second by second CPU usage of each replica, with spike as described in the text below

Finally, in this example, the average CPU usage across all the pods is consistently above the targetAverageCPUUtilization value for most of the 30 sec period (~ 26 sec). This results in an aggregated average CPU usage = 81% > 80% targetAverageCPUUtilization. So, HPA scales out the deployment by adding an additional replica.

Summarising the key findings on lossless detection.

  • Because HPA relies on aggregate metrics from the metrics server, brief workload spikes (in seconds) might not be sufficient to move the aggregate value over the HPA target
  • In addition to the above, because the trigger, in this case, is the average CPU utilization — which in itself is averaged over an interval — brief spikes that occur every second also might be lost (if averaged interval >> spike period), from HPA’s perspective

HPA might not be the best solution to detect brief/short-lived workload spikes

Solutions

Possible solutions to get HPA to scale for such workloads include:

  • Increasing the metrics resolution — If you’re using metrics-server to get the metrics, this might be as simple as configuring the — metric-resolution flag to a lower value than the 30 sec mentioned in the above example.
    Note: The higher the resolution of metrics the greater the overhead on the cluster. So, there is a tradeoff here between lossless detection and cluster overhead/reliability.

Tradeoff: Lossless detection vs Cluster overhead/reliability

  • Alternatively, use a burstable QoS for the pods expecting such workloads (i.e.) in the case of the example described above, set the value of limits parameter to > 4 times the value of requests parameter.
    So, if the pod requires only 2 CPU cores under normal circumstances, the requests parameter can be set to 2 and the limits parameter can be set to 8 (or more). Under normal circumstances, only 2 cores will be used but if there is a workload spike, the pod will be allowed to use more than the 2 cores requested, up to the value of the configured limits, 8 (or more) in this example. This approach does not use the HPA for scaling but rather uses a flexible container resource configuration to handle scalability.
    Note: Burstable QoS does not guarantee scalability (i.e.) more resources are allocated to the pod, only if available. If the node where the pod is scheduled is 100% busy, then the pod cannot get additional resources. In addition, pods with a Burstable QoS are more likely (not always though) to be evicted in case of resource pressure, as against pods with a Guaranteed QoS. So, the tradeoffs here are lossless detection vs scalability guarantee + availability

Tradeoff: Lossless detection vs Scalability guarantee + Availability

Responsiveness

Let’s consider the example of the steady workload described earlier in the Target Utilization section of this post.

In that section, we covered the time lag between detection (blue dotted line) and scaling (orange dotted line) and how this lag can be managed by tweaking the targetAverageCPUUtilization parameter (albeit at a cost).

In this section, we’ll delve into the root cause of this delay and look at possible ways to reduce it.

Time=HPA detection+Application startup; detection=metrics resolution+HPA sync; startup=download+initialization+readiness
Constituents of lag in autoscaling

As illustrated above, the major contributors to autoscaling delay with the HPA are:

  • HPA detection
  • Application startup

HPA Detection

Below is an illustration of the example Kubernetes cluster discussed earlier in this post.

Complex diagram showing kubernetes control plane and how the components interact — the text details the useful parts

As discussed earlier:

  • HPA does not receive events when there is a spike in the metrics. Rather, HPA polls for metrics from the metrics-server, every few seconds (configurable via — horizontal-pod-autoscaler-sync-period flag, 15 seconds in this example)
  • The metrics server, which HPA polls for metrics by default, itself polls for aggregate metrics over a period of time, once every few seconds (configurable via — metric-resolution flag, 30 sec in this example)

In this example, depending on when the HPA polls, there could be a delay of 30–45 seconds (30 seconds metrics server resolution + 15 seconds HPA polling frequency).

This is one of the contributors to the lag in autoscaling.

Application Startup

The other, and possibly a more important constituent of the autoscaling lag is application startup. At a high level, autoscaling with HPA is a 3 step process:

  1. Detection — HPA detects a breach of the target threshold
  2. Scale — HPA responds by issuing a scale request
  3. Container Ready — New replica(s) starts taking traffic

While step-3 of this process — container readiness — isn’t something the HPA is responsible for, it is essential for the autoscaling to have any impact at all. What is the point in scaling out a new replica, if it cannot take a share of the traffic, right?!

When the HPA issues a scale request, the Kubernetes control plane schedules the new pod to run on an appropriate worker node. But, there is a time lag between when the scheduler schedules the pod and when the pod actually starts taking traffic. This lag is caused by:

  1. Image downloads — For the pod to startup, the container images related to the pod need to be available on the worker node. If not, these images need to be downloaded from a repository. This can take some time, particularly if the container images are large (several MBs or more).
  2. Initialization procedures — Many applications rely on initialization procedures at startup, to load configurations, warmup the application, etc., The longer these procedures take, the longer it takes for the pod to move to the Ready state
  3. Readiness checks — Finally, a pod is not marked Ready unless it passes the readiness check. And pods cannot take traffic unless they’re marked Ready! It is not uncommon for application owners to specify a large initialDelaySeconds — The number of seconds after the container has started before liveness or readiness probes are initiated — because the time taken to complete initialization procedures are indeterminate. In such cases, even though the containers are ready, readiness checks might not be executed until the large initialDelaySeconds elapses. This causes further delay in the new pod taking traffic.

Solutions

Possible solutions to help autoscaler faster:

  • Keep container images small. The smaller the image, the faster it is to download from the repo
  • Keep the initialization procedures short — Avoid loading large configurations at startup; attempt to keep warmup routines short.
    For some applications, shorter initialization might mean a compromise on the runtime performance. Be mindful of that tradeoff.

Tradeoff: Startup performance vs Runtime performance

  • Keep the delays between readiness checks (including the initial delay) reasonable. Long delays might lead to a situation where the container is ready but the check is waiting for the delay to lapse before running the check.
    The challenge here is to estimate the duration of initialization procedures. It might be difficult to estimate this accurately but a reasonable approximation will do a lot better than a random delay, for the purposes of autoscaling
  • Finally, increase the metrics resolution and/or increase the HPA polling frequency. This might be as simple as configuring the -- metric-resolution flag (metrics-server)and the --horizontal-pod-autoscaler-sync-period (HPA) to a lower value.
    The higher the resolution of metrics/polling frequency the greater the overhead on the cluster. So, there is a tradeoff here between responsiveness and cluster overhead/reliability.

Tradeoff: Responsiveness vs Cluster overhead and reliability

Resilience

HPA is certainly useful for applications dealing with fluctuating workloads. But, at times, in trying to manage their capacity HPA can autoscale applications so much that one application can hog all the resources in a Kubernetes cluster, leaving very little for the other applications running on the cluster — something like the illustration below, where Application 1 is hogging all the resources.

Application 1 is taking all the spare resources in the cluster

You might be wondering — isn’t it a good thing that applications are able to scale out according to the workload? If the cluster is stressed out for resources, can the Cluster Autoscaler not be used to handle this?!

Unbounded scaling can be detrimental to the performance/reliability of applications in a Cluster

That does have some truth in it and is certainly a feasible solution. But imagine, if one of the applications is bombarded by bot traffic, 100 times its usual workload. In such cases, HPA might scale out the application by 100 times. This has the following side effects:

  1. It is very expensive, 100 times more expensive than usual
  2. All that money is spent on bot traffic (i.e.) it's not adding any business value
  3. It puts the Cluster under duress. Even though the Cluster Autoscaler will help alleviate some of the stress, the fact that the Cluster Autoscaler also needs time to detect and respond (lag) and the fact that most infrastructures have limits configured on the number of nodes that can be spun up, might not completely avoid the stress on the Cluster and the other applications running in the Cluster

Solution

One way to negate this is to limit the HPA to the maximum number of replicas, as against unbounded scaling. You can configure that using the maxReplicas parameter in the HPA manifest.

In the below sample manifest, HPA is limited to a maximum of 4 replicas.

For most applications, the workload is predictable. Hence it is not too hard to come up with a value for the maxReplicas parameter. But, what do you do, when the workload is not predictable!?

For example, assume you run a news website. Demand for news will go up and down depending on the news that is trending. How do you predict the maximum expected load on the system?!

It might not be possible to accurately predict the maximum expected workload. But, an approximate estimate that is based on facts and is reasonable will be a lot better than a random one. For example: In the case of the news website example, you can use the following metrics to come up with an estimate.

  1. The number of base users (N) — users who visit the website regularly — over time
  2. Workload (W) — in requests/sec — over time
  3. Capacity (C) — in requests/sec — of each replica

The below illustration shows these metrics over a period of 30 days.

Various graphs showing varying loads, users and replica counts. Outcome is described in the text

As can be seen here, the maximum number of pods used in a 30 day period is ~ 7. So, when choosing the maxReplicas parameter, it would be wise to add a buffer but make sure that it is reasonable (i.e.) setting maxReplicas parameter to 10 would be better in this case as against setting it to 100.

The trade-off here is resilience vs unlimited scaling. In the above example, if there is a piece of breaking news that completely throws off any past numbers, leading to 100x the usual workload, HPA will not be able to handle that.

Tradeoff: Resilience vs Unlimited scaling

Summary

  • In Post #1 of this series, we came up with a set of acceptance criteria for any autoscaling solution — Reliability, Efficiency, Responsiveness, and Resilience
  • The challenge for application owners is that each one of these has tradeoffs involved and decisions to make.
  • At times, it is possible that an attempt to help improve one criterion might compromise another.
  • Application owners need to be mindful of the fact that:

Enabling HPA is not the same as having a working autoscaling solution

  • The advice for application owners is to understand the tradeoffs, collect metrics and make conscious decisions to improve your HPA based autoscaling solution, so it works for your application

Learn more about technology at Expedia Group

--

--