How to Scale Your K8s Pods The Right Way

5 min readNov 21, 2023

Using resource limits to configure your HPA scaling policies.

tl;dr
The Kubernetes HPA, by default, uses the resource requests when calculating CPU and memory utilization. In order to configure your HPA to use resource limits when calculating resource utilization, use the following formula, where desiredPercentage is the average utilization of the resource limit you would like to set:

resource:
  name: cpu/memory
  target:
    type: Utilization
    averageUtilization: (limit/request) * desiredPercentage

Understanding HPA Scaling

The Horizontal Pod Autoscaler (HPA) in Kubernetes allows for the automatic scaling of the number of pods in a deployment based on observed CPU or memory utilization.

By default, when a user defines a target or average utilization, Kubernetes calculates that utilization based off the resource request. This means that if you set a CPU resource request to 100m and a limit of 200m, and set your HPA average utilization to 75%, your deployment will scale when CPU usage reaches 75m, not 150m.

Requests vs Limits Refresher

Before we dive deeper into why you might want to configure your HPA to scale based off your resource limit as opposed to the request, let’s get a quick refresher of what resource requests and limits are:

Requests:

Requests are used to specify the minimum amount of a resource (CPU or memory) that a container needs to run.
Scheduling: Kubernetes uses requests to decide which node to place a pod on. A pod is scheduled on a node only if the node has enough available resources to meet the pod’s request.

Limits:

Limits define the maximum amount of a resource that a container can use. If a container tries to exceed this limit, it will be throttled (in the case of CPU) or potentially terminated (in the case of memory).
Preventing Resource Starvation: Limits are useful for ensuring a container doesn’t use all of a node’s resources, which could starve other containers.

Why You Should Scale Based Off Resource Limits

Scaling Kubernetes pods based on limits rather than requests can be advantageous for ensuring optimal resource utilization and cost efficiency.

Maximizing Resource Utilization: Scaling based on limits allows you to fully utilize the resources of your nodes. Requests often represent the minimum needed resources, which may lead to underutilization of available resources. If you always scale your pods up when the utilization reaches 70% of your minimum resource requirements, you’ll never utilize the optimal amount of resources per pod.

Additionally, by setting a request, you are essentially telling the scheduler to reserve that amount of resources for your pod. But if you scale once you’ve hit 70% of those resources, you aren’t taking full advantage of the last 30% of the resources you reserved.

By scaling based on limits, you ensure that the pods use as much of the node’s resources as possible before triggering a scale-up, leading to more efficient use of your infrastructure.

Cost-Effective Scaling: By scaling based on limits, you are effectively delaying the scale-out action until it’s absolutely necessary (i.e., when the current resource usage is close to the maximum allowed). This can be more cost-effective, as it prevents premature scaling which might result in underutilized resources.

How Can We Set the HPA Rules to Scale Based Off Resource Limits

By default, setting a target or average utilization will configure your HPA to scale when the utilization percentage of the resource request reaches a certain value. For instance, if the request is for 100m and the utilization percentage is 75, the deployment will scale once the cpu usage reaches 75m.

How can we configure the HPA to calculate utilization percentage based on resource limits? So that, for instance, if my limit was set to 200m, and I wanted a utilization percentage is 75, the deployment will scale when the cpu usage reaches 150m.

The answer is a simple formula. First I will show you the formula and then we can delve into its simplistic explanation.

Formula -> (limit/request) × desiredPercentage

Understanding the Formula

Now, to explain how this works.

limit/request: This ratio gives you the proportion of the limit relative to the request. For example, if the limit is twice the request, this ratio would be 2.

desiredPercentage: This is the target utilization percentage you would like to set for scaling based on the limit. For example, if you set a target CPU utilization of 50% based on the limit, the desiredPercentage is 50.

When you multiply this ratio by the desiredPercentage, you effectively adjust the target utilization to reflect the limit rather than the request. For instance, if the limit is twice the request and your desiredPercentage is 50%, the formula would yield 100%. This means that the HPA will now scale up when the utilization reaches 100% of the request, which corresponds to 50% of the limit.

This formula allows you to keep using HPA, which inherently bases its calculations on requests, but with scaling behavior that considers limits.

Example

Suppose you have a container with a CPU request of 100m (milli-CPU) and a limit of 500m. You would like to set an HPA target CPU utilization of 60% based on the limit.

Applying the formula: (500m/100m) × 60 = 300

This calculation tells the HPA to target CPU utilization at 300% based on the request, which effectively makes it scale based on reaching 60% of the CPU limit.

Practical Example using Terraform

Using terraform variables lets us bake this formula into our modules:

resource "kubernetes_horizontal_pod_autoscaler_v2" "example_hpa" {
  metadata {
    name = "example-hpa"
    namespace = "default"
  }

  spec {
    min_replicas = 1
    max_replicas = 10

    metrics {
      type = "Resource"
      resource {
        name                     = "cpu"
        target_average_utilization = ceil((var.cpu_limit/var.cpu_request) * var.cpu_utilization)
      }
    }
  }
}

Conclusion

By adjusting the target utilization in this way, you can effectively make the HPA consider the limits for scaling decisions, even though it technically operates on request values.

Configuring our HPA around resource limits can help us ensure optimal resource utilization and cost efficiency. However, it’s important to balance this approach with the risks of potential resource starvation for other pods and the need for careful monitoring to avoid overloading nodes.

Disclaimer: Not all workloads benefit from the use of separate resource limits.
Read this article about cpu limits to understand why cpu limits might be hurting your deployments: https://web.archive.org/web/20220805232857/https://home.robusta.dev/blog/stop-using-cpu-limits/
Read this article why you might want to set your memory limit equal to your request: https://home.robusta.dev/blog/kubernetes-memory-limit