Auto Scaling in Kubernetes for more CPU resources by HPA and CA

8 min readJun 20, 2022

This is an example to show how HPA and CA work together to do auto-scaling. I am going to demonstrate the idea by CPU metric, and it should also work on memory as well.

TL;DR: HPA creates more pods, CA creates more nodes; Both rely on ‘CPU requests’. Tune this value so that new pods are forced to be scheduled on new nodes.

I have been working on k8s for almost three years to develop web applications for my clients. The web app is a CMS written in PHP and Vuejs and targeted for in-house use. We are expecting a few tens of users to access it simultaneously.

We came up to host the web app on Azure Kubernetes Service (AKS). Two dual-core machines were selected as the worker node. You may wonder about such a small-scale deployment, why choose AKS over traditional VMs with auto-scaling and load balancers, or modern ECS on AWS? Yes, I think they fit better here. However, sometimes (or I should say: as always!?!) stakeholders and business requirements override your technical decisions. 😅

Never mind, we are running on AKS now. As the software grows, users complain about the slow response time. PHP servers consume too much CPU time if there are many concurrent users. Code optimization is something that has to be done, but it takes time. The fastest way is to scale up and scale out AKS for more CPU power. 💰

I knew an auto-scaler comes with k8s, but I never go deep. So it’s the perfect time to drill into it. Turns out it took me almost a week of study to make it work. There are many high-level explanations on the topology of the Horizontal Pod Autoscaling (HPA) and Cluster Autoscaling (CA) separately. Still, I can barely find a complete example demonstrating how to bring them into a piece. In my case, I need both of them to work together! This drives me to jot it down on Medium. I suppose you already have basic Kubernetes knowledge like me but get sucked in playing with the auto-scaler. The way I showed here may not follow the best practice, I just want to share a working example so that you can go deeper on your own before giving up.

The implementation is less complex than I thought. I’ve just added a few blocks of setting into my original YAML file to make it works, and all these settings can be easily found on k8s official documents. What bothered me were:

What metrics are HPA and CA monitoring
The magic between HPA and CA
In what sequence of the auto-scalers do scaling
What parameter to set, and how to set

Moreover, most of the articles are focused on scale-out. How about scale-in? In my case, the web app is for in-house use, and the peak time is within office hours. If we can scale down the number of nodes at night, it will be a significant cost cut.

Metric Server

We need the metric server to run on at least one k8s node to collect resource metrics before we can go deep. It looks like the metric server is automatically deployed to the recent k8s versions. To check if the metric server is running:

# kubectl get pods --all-namespaces

You will see some pods starting with “metrics-server-” under the namespace of kube-system if you have the metric server running. If not, you have to deploy it yourself, ask google!

metrics-server should be deployed automatically

Horizontal Pod Autoscaling (HPA)

There is an official explanation and you can click the link above to see it. In short, it adds/removes pods to/from your replica set according to some rules given by you.

HPA can monitor many metrics to judge what scaling actions should make. Since I am running out of CPU, I will use CPU usage as the metric.

Consider my case: My CPU-demanded pod is the php servers. Initially, I have three replicas running. Now I wish to increase the replicas to, say 5, at an average CPU usage of 70%. We can run the following command to enable HPA for my PHP-server:

# kubectl autoscale deployment php-server --min=3 --max=8 --cpu-percent=60

Similarly, you can apply the YAML file, both mean the same.

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: php-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: php-server  
  minReplicas: 3
  maxReplicas: 8
  targetCPUUtilizationPercentage: 60

So what is 60% CPU utilization means exact? It is defined as the total CPU time (in the unit of millicore, 1vCPU = 1000millicore, or 1000m for short) consumed by all replicas of that deployment, divided by the total requests CPU unit specified in your deployment. It is worth noting that without specifying requests resources, the HPA will fail to work. You may read more about requests and limits here if you are unfamiliar with these two terms.

CPU and Memory Request and Limit defined in YAML file

We can keep check of HPA by the following command:

# kubectl get hpa

The column “TARGETS” shows the current CPU utilization (x) and the is your targeted threshold value (y).

Take an example, I have 3 PHP-servers which are consuming 150, 200, 100 millicore respectively, and my requested CPU resources for this deployment is set to 300, then the usage is (150+400+100) / (300x3) >60%. So now the HPA is going to scale out more pods to a maximum of 8 to see if it helps to reduce the utilization rate.

Maybe over simplified, anyway just think in this way:
If x > y then add pods; If x < y then remove pods

Here are two scenarios:

Your nodes still have plenty of unused CPU resources ready to accommodate more pods before reaching their physical limit. So, HPA can schedule all five running pods to this node, and the performance improves. The story ends, thanks for watching…
Your nodes only have a few cores and all the original three replicas already consumed all of the actual CPU time. HPA will still try to schedule more pods on these nodes. Even if they can be scheduled and you have more pods to serve now, there will be no performance improvement since the actual CPU time is over. This is exactly what I am facing.

The real question is, how to get more CPU time?
Pay more money to rent more nodes!

You can always add more nodes to your cluster manually. However, without a proper setting in HPA or even no HPA, pods will not be scheduled or rescheduled into your new nodes. So, adding more nodes doesn’t always result in performance improvement.

We have to find a way to deploy new pods on new nodes so that to improve our processing power. HPA together with CA will do the trick!

First of all, we have to stop new replicas created by HPA from being scheduled on the old nodes as they are too busy already. The magic here is, again, the resource requests defined in the YAML file. We should have a good practice to define requests to each pod so that every pod running on a single node has its own requests value. Consider you have a new node with two vCPU, your total CPU time is 2000m. For a pod to be successfully scheduled and running on a node, the requests of this pod must not be greater than the CPU time remains. Take an example, our PHP-server requests 300m per pod, a 2000m node should be able to hold up to 6 (2000/300) but not 7 PHP-server, assuming that there are no other pods scheduled (In fact, it could be less than six as all nodes have some k8s controlled pod running in the background and contributing to requests). For the 7th or even more pods scaled out by HPA, its status would become Pending rather than Running. You can check the pod status by:

# kubectl get pods -owide

One of the pods showing Pending

If you describe the pending pod, you will see “insufficient CPU”.

# kubectl describe pod pendingPodName

The pending pod was due to insufficient CPU

To check the total requests from all running pods on a node:

# kubectl describe node yourNodeName

A dual-core machine allocated 98% of the CPU resources to its pods

HPA judges whether your node has the ability to schedule more pods by comparing the total requests from all running pods to your total CPU time on your node

Cluster Autoscaling

To recap our situation at this moment: We have a two cores node with 3 PHP pods fully loaded the cores. HPA wants to scale out more PHP pods, and we use the ‘requests’ metrics to prevent it from deploying on our existing fully loaded node. We want to deploy on some new nodes.

Cluster autoscaling is the trick to allow your infrastructure to scale out more physical nodes to tackle the situation above (and of course, scale in). Let’s enable cluster autoscaling in Azure CLI first:

# az aks nodepool update -g yourResourceGroup --cluster-name yourAksName -n yourPool --enable-cluster-autoscaler -min-count desiredMinCount --max-count desiredMaxCount

Note: Cluster Autoscaling is cloud provider dependent, common providers, say, Azure, AWS and google do support CA. The above command only works for Azure so please be sure you use the right command according to your cloud provider.

It tells your Azure AKS to enable CA with some minimum and maximum node counts as you desired.

Once you enabled CA, when it detected there are pending pods due to insufficient resources, such as CPU or memory, it will try to add more nodes before it reaches the maximum nodes count.

You can get the status of the nodes by:

# kubectl get nodes

One of the nodes is creating but in NotReady status

It takes a couple of minutes to get the new node ready, depending on what size of machine you have chosen.

When the new nodes get ready, HPA will take over again and the pending pods should be scheduled onto the new empty nodes.

Okay, problem solved, now we have more pods running on new nodes. In practice, you may want to control what functional pods are to be deployed to the new nodes so that we can automatically shut some nodes down during non-peak hours without interrupting the services. I will share my experience on scale-in and node affinity in the next post.

If you find they are useful, give me some applause and please follow me. Thanks for watching!

Auto Scaling in Kubernetes for more CPU resources by HPA and CA

Metric Server

Horizontal Pod Autoscaling (HPA)

Cluster Autoscaling

Written by Kenneth Tang