Autoscaling in Kubernetes

Published in

Softonic Engineering

7 min readJun 12, 2019

One of the advantages of having a cloud-based infrastructure is its flexibility: you pay for what you use, and you do not pay for it when it’s not needed. Except it’s not exactly like that: deciding when you need more and how much more you need is not easy. Kubernetes helps you with that, with its autoscaling features: this article will show how we implemented that in Softonic.

Kubernetes offers three main autoscaling dimensions to leverage:

Horizontal Pod Autoscaler: controls the number of replicas in a deployment
Vertical Pod Autoscaler: controls the amount of requested resources (CPU and Memory) for a pod
Cluster Autoscaler: controls the number of nodes in a cluster

Real life example

Let’s assume we want to deploy a web app and apply autoscaling to it.

We start with a virtually empty cluster, let’s say with 1 node with 4GB of Memory and 2 CPUs, but we set up the cluster autoscaler to fire up some of the same type of nodes if needed (normally we have some infrastructure deployed on it that occupies some resources, but for the sake of simplicity of this example we will ignore it).

We deploy a web application on it, using a deployment object and a pod spec that requests 500Mb and 500m CPUs each replica.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: nginx
        image: myapp:1.0.0
      resources:
        requests:
          cpu: 500m
          memory: 500Mi
        limits:
          cpu: 1
          memory: 600Mi

HPA

To this application, we apply HPA as follows:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
 name: myapp-hpa
 namespace: myapp
spec:
 scaleTargetRef:
   apiVersion: apps/v1
   kind: Deployment
   name: myapp
 minReplicas: 1
 maxReplicas: 10
 metrics:
 — type: Resource
   resource:
     name: cpu
     target:
       type: Utilization
       averageUtilization: 50

Once deployed this HPA object, the autoscaler will calculate how many replicas of the deployment are needed based on the CPU utilization, through the metrics endpoint.
Using the metrics endpoint it’s easy, and it is already registered by default, we can just deploy our application and set-up an HPA object and we’re good to go.

As we do not have any traffic, the HPA sets the number of replicas to 1, which is the minimum, and its CPU usage will likely be 0%.

If we receive a burst of traffic, the pod will reach 200% of CPU (as nodejs is single threaded, it will consume a whole CPU at most, and as we have declared 500m as request, it’s actually 200% of the requested value).
As soon as the controller detects this, it will fire up another 3 replicas, for a total of 4, so that each pod could potentially use 50% of CPU. In this case our total consumption is 4*500m = 2CPUs and 500MB * 4 2GB Memory, it still fits on one node [1].

But we did not take into account that our application can consume maximum 1 CPU, and it was actually accumulating requests. Therefore, each pod will consume more than 50% of CPU, to accommodate the requests that were accumulating, and they will settle around 70% of consumption each.

So, in the end the application handled all the traffic, but it needed 2 cycles of HPA controller and to wait for the cluster autoscaler to fire up the node. Good, but not ideal. How can we improve it?

First, we need to detect what could be improved: it seems the biggest problem was that we are unable to detect accumulating requests. So, what if we could autoscale based on number of the requests?

Custom Metrics

In order to achieve that, we need to first export custom metrics. That can be achieved registering the custom metrics endpoint, and in Softonic’s case we have deployed prometheus adapter so that we can expose the metrics we already collect through prometheus.

We test our application, and we notice it can handle 20 requests per second on average before degrading its performance, so we declare that in the HPA as follows:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
 name: myapp-hpa
 namespace: myapp
spec:
 scaleTargetRef:
   apiVersion: apps/v1
   kind: Deployment
   name: myapp
 minReplicas: 1
 maxReplicas: 10
 metrics:
 — type: Resource
   resource:
     name: cpu
     target:
       type: Utilization
       averageUtilization: 50
 — type: Pods
   pods:
     metricName: requests
     targetAverageValue: 20

So now, if we receive another burst of traffic when we have one lonely pod running, we can actually count requests per second incoming: if we get 120 req/s, we can scale up to 6 pods directly, while we would’ve scaled up to 4 pods with CPU only. We also have left CPU metric so that we are protected against requests that require a lot of CPU, but are not many in volume.

With 6 pods, the total consumption will be of 3GB of Memory and 3CPUs. The first 4 pods will allocate normally, but the fifth will have no space, and it will be left in a Pending state. The cluster autoscaler will notice that, and fire up another node. Once the node will be up, the other two pods will be scheduled.

So that’s better, we went through only one cycle of HPA, but we still needed to wait for the cluster autoscaler.

VPA

We have declared resource usage of 500m of CPU, and 500MB of memory, but how much memory and CPU is really consuming our application? We could watch the collected metrics, and adjust the requested values based on that. But there is a better way: Vertical Pod Autoscaler.

With VPA, we deploy a VPA object to monitor a deployment, as follows:

apiVersion: autoscaling.k8s.io/v1beta2
kind: VerticalPodAutoscaler
metadata:
 name: my-app-vpa
spec:
 targetRef:
   apiVersion: apps/v1
   kind: Deployment
   name: myapp
 updatePolicy:
 updateMode: "Auto"

The above object, will monitor for us CPU and memory usage, and change its requirement. In the mode we have deployed above, it will destroy and recreate pods with the new requirements when it sees fit, but there are other modes.

auto and recreate will destroy and recreate pods when it sees fit
initial only when other action are recreating pods
off will only recommend the values without taking any action

But wait a moment, now our requested CPU value will be lowered, therefore incrementing the percentage used, as the usage will more or less stay the same — and eventually our HPA will kick in. Creating this VPA object, actually
created a race condition between VPA and HPA. However, as we already run custom metrics, this is easily fixed: instead of relative CPU usage, we should set HPA with absolute CPU usage, to keep ourselves protected against CPU-intensive requests.

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
 name: myapp-hpa
 namespace: myapp
spec:
 scaleTargetRef:
   apiVersion: apps/v1
   kind: Deployment
   name: myapp
 minReplicas: 1
 maxReplicas: 10
 metrics:
 — type: Pods
   pods:
     metricName: requests
     targetAverageValue: 20
 — type: Pods
   pods:
     metricName: cpu_usage
     targetAverageValue: 250m

Once we got rid of this race condition, we can wait for VPA to collect enough metrics, and eventually it will settle on the real app usage (given enough traffic that is).

Let’s imagine VPA calculates our app real usage to be 400m CPUs and 350MB of Memory: that’s good - we’ve gained a little - but not enough to save anything. However the biggest advantage that we have is that - if our app changes — its requirements will change automatically.

Faster autoscaling: balloon pods technique

If our app is prone to traffic bursts, we might want to get rid of that waiting time for firing up a node. One technique that we have experimented with in Softonic, is the balloon pods technique.

That requires PriorityClass to be implemented: the idea is to set up a useless pod with low priority that will occupy some space, depending on our needs, and when the resources are needed it will be kicked out and replaced with our actual app, and immediately cluster autoscaler will fire up another node, as the balloon pod is in Pending state.
Of course, the disadvantage of this strategy, is that we constantly occupy some resources without real need, but “Just in case”. This might be worth it for bigger deployments, but for smaller ones it will probably be best to just wait the node to be ready: GKE is improving a lot in latest version the boot time for each node.

External metrics

Not all apps are web apps, and some might need autoscaling based on metrics that are not its own, but from external objects. For example, a consumer might want to increase the number of its replica when
messages are accumulating in a given queue. This can be achieved with external metrics:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: myconsumer-hpa
spec:
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: External
    external:
    metricName: queue_messages
      metricSelector:
        matchLabels:
          node: rabbit
          queue: myqueue
      targetAverageValue: "100"
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myconsumer

In this case, when more than 200 messages are queued up in the queue myqueue, two replicas will be fired up. As soon as the messages are below 100 message approximately, the replicas will be set again to one.

Learnings

For Softonic’s case, implementing HPA gave us a lot of resilience to massive crawling and traffic spikes. It also brought considerable cost saving, however VPA is the feature that brought us more cost saving overall: we shaved 50% of our requested resources during the first week that the feature has been rolled out.

[1] Normally there is an overhead each node, but for sake of simplicity we consider the declared resources are all allocatable.

Autoscaling in Kubernetes

Real life example

Written by Riccardo