Autoscaling a Kubernetes cluster created with acs-engine on Azure

This article assumes that you already have a Kubernetes cluster created with acs-engine up and running. If that is not the case you can read my previous article which explained how to do so (in the context of machine learning) or read the official documentation.

In this article we will take a look at Kubernetes-acs-engine-autoscaler (a fork of OpenAI's Kubernetes-ec2-autoscaler) and how we can autoscale an acs-engine Kubernetes cluster with it.
Toward the end of this article, we will also see how to use it in conjunction with Kubernetes Horizontal Pod Autoscaling to allow scaling based on some metrics such as CPU utilization.

If your cluster wasn't created with acs-engine, and is based on VM scale sets, you should instead use cluster-autoscaler, or OpenAI's work in progress

This autoscaler will run inside the cluster and monitor the different pods that get scheduled. Whenever a pod is pending because of a lack of resources, the autoscaler will create an adequate number of new VMs to support them. Finally, when VMs are idle, the autoscaler will delete them.

That way we can achieve the flexibility we want, while still keeping costs down.

Here is a nice illustration of how all of this works behind the scene taken from OpenAI's Kubernetes-ec2-autoscaler:

Prerequisites

In order to work properly, the autoscaler require that all the pods running on your cluster have specific resources limits defined. This allows the autoscaler to know how much new VMs to create when some pods are pending, and which VM size fits the best. 
You don't need to provide a value for all the types of resources though. Only what you explicitly need. 
For example, if your pod is supposed to train a machine learning model with GPU, and you don't really care about the number of CPUs you can simply specify:

resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 2

Setting up the autoscaler

Edit: There is now a Helm chart for the autoscaler which makes the process much smoother: stable/acs-engine-autoscaler

In order to communicate with Azure to create or delete VMs, the autoscaler needs some credentials: a Service Principal. 
The easiest way to create a Service Principal is using Azure CLI:

> az ad sp create-for-rbac
{
"appId": "3c460b40-c0d9-4947-89f5-7ee3aeaf9fc4",
"displayName": "azure-cli-2017-05-14-13-59-55",
"name": "http://azure-cli-2017-05-14-13-59-55",
"password": "629d9e47-9f34-5d39-977e-ae96d9cef8bd",
"tenant": "72f918bf-87f1-41af-31ab-2d9cd011db47"
}

We also need to grab some private keys from the output generated by acs-engine when creating your cluster: the kubeConfigPrivateKey, clientPrivateKeyand and caPrivateKey. You will find both of them in the azuredeploy.parameters.json file generated.

We are then going to make those credentials available inside our cluster using a Kubernetes secret:

Then simply create the secret:

$ kubectl create -f secret.yaml

Once this is done, we can actually create our Kubernetes template to deploy the autoscaler.

  • It is a Deploymentwith only 1 instance. That means that at any give time we have one and only one pod running. We wouldn't want to issue the same scaling command multiple times.
  • The Service Principal credentials and the private keys are passed as environment variable and taken from the secret we created earlier
  • Finally the autoscaler is started with the --dry-run flag. When this flag is enabled the autoscaler will never actually scale in or out, allowing you to make sure everything works as you intended first.

Create the controller with kubectl create -f scaling-deployment.yaml.

If you look at the logs of the pod once it finishes creating, you should see something similar to this :

autoscaler.cluster - INFO - ++++++ Running Scaling Loop ++++++
autoscaler.cluster - INFO - Pods to schedule: 0
autoscaler.cluster - INFO - ++++++ Scaling Up Begins ++++++
autoscaler.cluster - INFO - Nodes: 2
autoscaler.cluster - INFO - To schedule: 0
autoscaler.cluster - INFO - Pending pods: 0
autoscaler.cluster - INFO - ++++++ Scaling Up Ends ++++++
autoscaler.cluster - INFO - ++++++ Maintenance Begins ++++++
autoscaler.cluster - INFO - ++++++ Maintaining Nodes ++++++
autoscaler.cluster - INFO - node: k8s-agentpool1-14254244-0 state: spare-agent
autoscaler.cluster - INFO - node: k8s-agentpool2-14254244-0 state: spare-agent
autoscaler.cluster - INFO - ++++++ Maintenance Ends ++++++

My cluster has 2 agent pools, each with a single agent, and a single job running. Your output will be different based on your cluster architecture and what pods are currently running.

If you have unused VMs, the autoscaler will warn you that it would delete them if you remove the --dry-run flag.
Once you make sure that the autoscaler is not going to wreak havoc in your cluster you can redeploy without this flag.

💥 Don't use it on a production cluster until you really understand it's behavior and what the different parameters do.💥

The job I am running is using 1 GPU. In my case, only agentpool2 has VMs with GPU.
So if I now schedule a second similar job on my cluster, we will see agentpool2 scaling up to meet the demand:

autoscaler.cluster - INFO - ++++++ Running Scaling Loop ++++++
autoscaler.cluster - INFO - Pods to schedule: 1
autoscaler.cluster - INFO - ++++++ Scaling Up Begins ++++++
autoscaler.cluster - INFO - Nodes: 2
autoscaler.cluster - INFO - To schedule: 1
autoscaler.cluster - INFO - Pending pods: 1
autoscaler.cluster - INFO - ========= Scaling for 1 pods ========
[...]
autoscaler.cluster - INFO - New capacity requested for pool agentpool2: 2 agents (current capacity: 1 agents)
autoscaler.deployments - INFO - Deployment started

After a few minutes, the new VM is up, and our second job starts running.

Once the jobs are completed, our cluster is idle. 
The autoscaler will notice this and adjust the cluster size accordingly.

First, idle VMs will be cordoned and drained:

autoscaler.cluster - INFO - node: k8s-agentpool1-32238962-1                                                   state: under-utilized-drainable
autoscaler.kube - INFO - cordoned k8s-agentpool1-32238962-1
autoscaler.kube - INFO - Deleting Pod kube-system/kube-proxy-ghr3z
autoscaler.kube - INFO - drained k8s-agentpool1-32238962-4

And after some time, the cordoned node will get deleted:

autoscaler.cluster - INFO - node: k8s-agentpool1-32238962-1                                                   state: idle-unschedulable
autoscaler.container_service - INFO - deleting node k8s-agentpool1-32238962-1
autoscaler.container_service - INFO - Deleting VM
autoscaler.container_service - INFO - Deleting NIC
autoscaler.container_service - INFO - Deleting OS disk

Horizontal Pod Autoscaling

So far so good. But you still need to schedule new pods manually. This is fine for some scenarios (such as training jobs in machine learning), but what if you want to scale up based on some metrics?

This is where Kubernetes Horizontal Pod Autoscaling (HPA) makes it's entrance.

HPA allows use to specify a metric to track on a deployment, CPU usage for example, and a target, let's say 50%.
Whenever your deployment's pods have a combined average CPU usage over the 50%, HPA will increase the number of replicas in your deployments and spread the load.

Eventually, the existing VMs in your cluster will not be able to support more replicas, and new pods created by HPA will start hanging in a <pending> state.

At this point, the autoscaler will notice these pending pods and create new VMs to support them.

To understand how to configure Horizontal Pod Autoscaling (spoiler: it's very easy), check out the official documentation.

Advanced Scenarios

Kubernetes-acs-engine-autoscaler has a number of parameters that we didn't touch in this article to accomodate advanced scenarios.

Check out the repository documentation to know more.

If you see any mistake in this post, or have any question, feel free to open an issue on the dedicated GitHub repo.