Scale your kubernetes cluster to (almost) zero with GKE autoscaler

Alfonso Palacios
Google Cloud - Community
9 min readDec 12, 2019

Introduction

Although kubernetes is best known for orchestrating permanent workloads (deployments) that adapt to demand (horizontal pod autoscaler), it is increasingly being used for ephemeral batch processes. In some cases, these processes require specialized hardware that can be expensive or rare. This is the case of machine learning training jobs, which often require large instance types with GPUs. In these cases, you will want to make sure your cluster releases all of those resources as soon as the job is complete.

Consider a machine learning training job with the following requirements (true story):

  • The job must run on 4 nodes in parallel.
  • Each node is a n1-highmem-96 machine with 8 NVIDIA Tesla V100 GPUs (32 in total).
  • The training job takes approximately 12 hours to complete.

At the current rates, with no particular discounts applied, this setup would cost over $100 per hour. We want to release those nodes as soon as our job has ended to save some money.

Cluster autoscaler to the rescue

One of the most interesting features you get when you use GKE is the cluster autoscaler functionality. As described in the GKE documentation, Cluster autoscaler allows to:

automatically resize your GKE cluster’s node pools based on the demands of your workloads. When demand is high, cluster autoscaler adds nodes to the node pool. When demand is low, cluster autoscaler scales back down to a minimum size that you designate. This can increase the availability of your workloads when you need it, while controlling costs.

However, cluster autoscaler cannot completely scale down to zero a whole cluster. At least one node must always be available in the cluster to run system pods. So you need to keep at least one node. But this doesn’t mean you need to keep one expensive node running idle.

Another very interesting feature of GKE are node pools. A node pool is a group of nodes within a cluster that all have the same configuration. Every cluster has at least one default node pool, but you can add other node pools as needed.

So, for our ML training needs, we will create a cluster with two node pools:

  1. A default node pool with a fixed size of one node with a small instance size (e.g. g1-small).
  2. A second node pool with (we’ll call it the burst pool) with the instance types that we need for our ML training job (n1-highmem-96 machine with 8 NVIDIA Tesla V100 GPUs). We’ll set up cluster autoscaling on this node pool to allow a maximum number of 4 nodes, and a minimum of 0 nodes.

Making sure your pods run in the burst node pool

Now that we have a GKE cluster configured to fit our autoscaling needs, we need to make sure that our ML training workload runs on the burst pool. We will want to make sure of the following:

  1. We want our ML training jobs to be run on the burst node pool, since this is where highmem instances with GPUs will be created.
  2. Our ML training job is designed to take all the available resources in a node, and expects to have one single training pod running in each node.
  3. The burst node pool must be dedicated exclusively for the ML training jobs. We do not want to allow any other workload to run on these nodes, since we’ll want to release them as soon as the job is done.

We do not need any special GKE features to meet these requirements. The following standard kubernetes features will help us achieve our workload distribution requirements:

  1. We’ll use a node selector in our pods to make sure they run in the bust node pool. For this, we’ll add a label to the nodes in that pool.
  2. We’ll use an anti-affinity rule to ensure that two of your training pods cannot be scheduled on the same node.
  3. We’ll add a taint to our bust pool nodes to prevent other workloads from running in the burst node pool. We’ll need to add the appropriate toleration to our ML training pods to let them run on those nodes.

Putting it all together

Putting this into practice, we’ll start by creating a cluster with a default node pool that contains only one small node:

PROJECT_ID="apszaz-kube-playground"
GCP_ZONE="europe-west1-b"
GKE_CLUSTER_NAME="burstable-cluster"
GKE_BURST_POOL="burst-zone"

gcloud container clusters create ${GKE_CLUSTER_NAME} \
--machine-type=g1-small \
--num-nodes=1 \
--zone=${GCP_ZONE} \
--project=${PROJECT_ID}

Now, we’ll add the burst node pool using the following parameters:

  • --machine-type=n1-highmem-96: this is the instance type we want to use for our ML training job. As opposed to the default node, which contains only one instance of type g1-small.
  • --accelerator=nvidia-tesla-v100,8: we want 8 NVIDIA TESLA V100 GPUs in each node. These GPUs are not available in all regions and zones, so we will need to find a zone with enough capacity.
  • --node-labels=gpu=tesla-v100: we add a label to the nodes in the burst pool to allow selecting them in our ML training workload using a node selector.
  • --node-taints=reserved-pool=true:NoSchedule: we add a taint to the nodes to prevent any other workload from accidentally being scheduled in this node pool.

The rest of the options refer to the autoscaling and are self-explanatory. The full command will look like this:

gcloud container node-pools create ${GKE_BURST_POOL} \
--cluster=${GKE_CLUSTER_NAME} \
--machine-type=n1-highmem-96 \
--accelerator=nvidia-tesla-v100,8 \
--node-labels=gpu=tesla-v100 \
--node-taints=reserved-pool=true:NoSchedule \
--enable-autoscaling \
--min-nodes=0 \
--max-nodes=4 \
--zone=${GCP_ZONE} \
--project=${PROJECT_ID}

To test the configuration, we will create a job that will create 4 pods running in parallel during 10 minutes. The pods in our workload will need to have the following elements:

  • A nodeSelector matching the label we have added to our burst node pool: gpu=tesla-v100.
  • A podAntiAffinity rule indicating that we do not want two pods with the same label app=greedy-job running in the same node. For this we will add the appropriate label to our pods, and indicate in the topologyKey that it applies at the hostname level (no two such pods in the same node).
  • Finally, we need to add a toleration to the taint that we attached to the nodes, so these podes are allowed to be scheduled in these nodes.

The full job YAML file (let’s call it greedy_job.yaml) will look like this:

apiVersion: batch/v1
kind: Job
metadata:
name: greedy-job
spec:
parallelism: 4
template:
metadata:
name: greedy-job
labels:
app: greedy-app
spec:
containers:
- name: busybox
image: busybox
args:
- sleep
- "300"
nodeSelector:
gpu: tesla-v100
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- greedy-app
topologyKey: "kubernetes.io/hostname"
tolerations:
- key: reserved-pool
operator: Equal
value: "true"
effect: NoSchedule
restartPolicy: OnFailure

Verifying it works

First thing, you will need to get the cluster credentials to be able to run kubectl commands on this cluster:

gcloud container clusters get-credentials ${GKE_CLUSTER_NAME} \
--zone=${GCP_ZONE} \
--project=${PROJECT_ID}

We see that, initially, GKE has started the cluster with 3 nodes in the burst pool and one in the default pool:

~ $ kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-burstable-cluster-burst-zone-183c2d4b-2vw9 Ready <none> 9m7s v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-jkzt Ready <none> 9m10s v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-p2w8 Ready <none> 9m7s v1.13.11-gke.14
gke-burstable-cluster-default-pool-794fe9e9-jdk3 Ready <none> 12m v1.13.11-gke.14

Let’s wait for the cluster to cool down and remove the nodes in the burst pool. After a few minutes, we see all the burst pool nodes have been removed:

NAME                                               STATUS   ROLES    AGE   VERSION
gke-burstable-cluster-default-pool-794fe9e9-jdk3 Ready <none> 24m v1.13.11-gke.14

Now that we have our cluster in standby mode (with no nodes in the burst pool), we can start running our test. We will use the job we defined in the previous section (we called it greedy_job.yaml). This job will run four processes that will run in parallel and that will complete after 10 minutes.

Initially we have no pods running in the default namespace:

~ $ kubectl get pods
No resources found.

and, as we saw earlier, only the dummy node from the default node pool:

~ $ kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-burstable-cluster-default-pool-794fe9e9-jdk3 Ready <none> 26m v1.13.11-gke.14

If we apply our job:

~ $ kubectl apply -f greedy_job.yaml
job.batch/greedy-job created

We see that the pods are created, but are pending for a little while:

~ $ kubectl get pod
NAME READY STATUS RESTARTS AGE
greedy-job-9wlb8 0/1 Pending 0 8s
greedy-job-hr2tc 0/1 Pending 0 8s
greedy-job-lqshk 0/1 Pending 0 8s
greedy-job-mcbmm 0/1 Pending 0 8s

If you look at the events in one of the pods, you will see it has triggered a cluster scale up event:

~ $ kubectl describe pod greedy-job-9wlb8 
Name: greedy-job-9wlb8
Namespace: default
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 26s default-scheduler 0/1 nodes are available: 1 node(s) didn't match node selector.
Normal TriggeredScaleUp 20s cluster-autoscaler pod triggered scale-up: [{https://content.googleapis.com/... 0->1 (max: 4)}]

We see that little by little the pods are stating to run:

~ $ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
greedy-job-9wlb8 1/1 Running 0 2m47s 10.16.1.2 gke-burstable-cluster-burst-zone-183c2d4b-n1f3 <none> <none>
greedy-job-hr2tc 1/1 Running 0 2m47s 10.16.2.2 gke-burstable-cluster-burst-zone-183c2d4b-sf5r <none> <none>
greedy-job-lqshk 0/1 Pending 0 2m47s <none> <none> <none> <none>
greedy-job-mcbmm 0/1 ContainerCreating 0 2m47s <none> gke-burstable-cluster-burst-zone-183c2d4b-jm49 <none> <none>

check again after a few seconds:

~ $ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
greedy-job-9wlb8 1/1 Running 0 4m27s 10.16.1.2 gke-burstable-cluster-burst-zone-183c2d4b-n1f3 <none> <none>
greedy-job-hr2tc 1/1 Running 0 4m27s 10.16.2.2 gke-burstable-cluster-burst-zone-183c2d4b-sf5r <none> <none>
greedy-job-lqshk 1/1 Running 0 4m27s 10.16.4.2 gke-burstable-cluster-burst-zone-183c2d4b-kbw2 <none> <none>
greedy-job-mcbmm 1/1 Running 0 4m27s 10.16.3.2 gke-burstable-cluster-burst-zone-183c2d4b-jm49 <none> <none>

And, once the 10 minutes are passed, the pods terminate:

~ $ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
greedy-job-9wlb8 0/1 Completed 0 7m58s 10.16.1.2 gke-burstable-cluster-burst-zone-183c2d4b-n1f3 <none> <none>
greedy-job-hr2tc 0/1 Completed 0 7m58s 10.16.2.2 gke-burstable-cluster-burst-zone-183c2d4b-sf5r <none> <none>
greedy-job-lqshk 1/1 Running 0 7m58s 10.16.4.2 gke-burstable-cluster-burst-zone-183c2d4b-kbw2 <none> <none>
greedy-job-mcbmm 0/1 Completed 0 7m58s 10.16.3.2 gke-burstable-cluster-burst-zone-183c2d4b-jm49 <none> <none>

If we keep looking at our nodes, we see that, after 10 minutes or so since the first batch job completed, their status becomes NotReady (they are being drained), and finally disappear. You can use this command to see the list of nodes every 60 seconds:

while true; do kubectl get nodes ; sleep 60; done

The output looks something like this:

NAME                                               STATUS   ROLES    AGE   VERSION
gke-burstable-cluster-burst-zone-183c2d4b-jm49 Ready <none> 14m v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-kbw2 Ready <none> 13m v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-n1f3 Ready <none> 16m v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-sf5r Ready <none> 15m v1.13.11-gke.14
gke-burstable-cluster-default-pool-794fe9e9-jdk3 Ready <none> 45m v1.13.11-gke.14
NAME STATUS ROLES AGE VERSION
gke-burstable-cluster-burst-zone-183c2d4b-jm49 Ready <none> 15m v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-kbw2 Ready <none> 14m v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-n1f3 Ready <none> 17m v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-sf5r NotReady <none> 16m v1.13.11-gke.14
gke-burstable-cluster-default-pool-794fe9e9-jdk3 Ready <none> 46m v1.13.11-gke.14
NAME STATUS ROLES AGE VERSION
gke-burstable-cluster-burst-zone-183c2d4b-jm49 NotReady <none> 16m v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-kbw2 Ready <none> 15m v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-n1f3 Ready <none> 18m v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-sf5r NotReady <none> 17m v1.13.11-gke.14
gke-burstable-cluster-default-pool-794fe9e9-jdk3 Ready <none> 47m v1.13.11-gke.14
NAME STATUS ROLES AGE VERSION
gke-burstable-cluster-burst-zone-183c2d4b-kbw2 NotReady <none> 16m v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-n1f3 Ready <none> 19m v1.13.11-gke.14
gke-burstable-cluster-default-pool-794fe9e9-jdk3 Ready <none> 48m v1.13.11-gke.14
NAME STATUS ROLES AGE VERSION
gke-burstable-cluster-burst-zone-183c2d4b-kbw2 NotReady <none> 17m v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-n1f3 Ready <none> 20m v1.13.11-gke.14
gke-burstable-cluster-default-pool-794fe9e9-jdk3 Ready <none> 49m v1.13.11-gke.14
NAME STATUS ROLES AGE VERSION
gke-burstable-cluster-burst-zone-183c2d4b-n1f3 NotReady <none> 21m v1.13.11-gke.14
gke-burstable-cluster-default-pool-794fe9e9-jdk3 Ready <none> 50m v1.13.11-gke.14
NAME STATUS ROLES AGE VERSION
gke-burstable-cluster-burst-zone-183c2d4b-n1f3 NotReady <none> 22m v1.13.11-gke.14
gke-burstable-cluster-default-pool-794fe9e9-jdk3 Ready <none> 51m v1.13.11-gke.14
NAME STATUS ROLES AGE VERSION
gke-burstable-cluster-default-pool-794fe9e9-jdk3 Ready <none> 52m v1.13.11-gke.14

After a few minutes, all the burst nodes have been removed, and only the g1-small node from the default pool remains.

Conclusion

We managed to create a cluster that scales down to one very small (and inexpensive) node just by using:

If you do not care that much about the actual node specs, you can also use node auto-provisioning. With node auto-provisioning, new node pools can be created and deleted automatically based on the specifications of unschedulable Pods.

--

--