Scale your kubernetes cluster to (almost) zero with GKE autoscaler

Published in

Google Cloud - Community

9 min readDec 12, 2019

Introduction

Although kubernetes is best known for orchestrating permanent workloads (deployments) that adapt to demand (horizontal pod autoscaler), it is increasingly being used for ephemeral batch processes. In some cases, these processes require specialized hardware that can be expensive or rare. This is the case of machine learning training jobs, which often require large instance types with GPUs. In these cases, you will want to make sure your cluster releases all of those resources as soon as the job is complete.

Consider a machine learning training job with the following requirements (true story):

The job must run on 4 nodes in parallel.
Each node is a n1-highmem-96 machine with 8 NVIDIA Tesla V100 GPUs (32 in total).
The training job takes approximately 12 hours to complete.

At the current rates, with no particular discounts applied, this setup would cost over $100 per hour. We want to release those nodes as soon as our job has ended to save some money.

Cluster autoscaler to the rescue

One of the most interesting features you get when you use GKE is the cluster autoscaler functionality. As described in the GKE documentation, Cluster autoscaler allows to:

automatically resize your GKE cluster’s node pools based on the demands of your workloads. When demand is high, cluster autoscaler adds nodes to the node pool. When demand is low, cluster autoscaler scales back down to a minimum size that you designate. This can increase the availability of your workloads when you need it, while controlling costs.

However, cluster autoscaler cannot completely scale down to zero a whole cluster. At least one node must always be available in the cluster to run system pods. So you need to keep at least one node. But this doesn’t mean you need to keep one expensive node running idle.

Another very interesting feature of GKE are node pools. A node pool is a group of nodes within a cluster that all have the same configuration. Every cluster has at least one default node pool, but you can add other node pools as needed.

So, for our ML training needs, we will create a cluster with two node pools:

A default node pool with a fixed size of one node with a small instance size (e.g. g1-small).
A second node pool with (we’ll call it the burst pool) with the instance types that we need for our ML training job (n1-highmem-96 machine with 8 NVIDIA Tesla V100 GPUs). We’ll set up cluster autoscaling on this node pool to allow a maximum number of 4 nodes, and a minimum of 0 nodes.

Making sure your pods run in the burst node pool

Now that we have a GKE cluster configured to fit our autoscaling needs, we need to make sure that our ML training workload runs on the burst pool. We will want to make sure of the following:

We want our ML training jobs to be run on the burst node pool, since this is where highmem instances with GPUs will be created.
Our ML training job is designed to take all the available resources in a node, and expects to have one single training pod running in each node.
The burst node pool must be dedicated exclusively for the ML training jobs. We do not want to allow any other workload to run on these nodes, since we’ll want to release them as soon as the job is done.

We do not need any special GKE features to meet these requirements. The following standard kubernetes features will help us achieve our workload distribution requirements:

We’ll use a node selector in our pods to make sure they run in the bust node pool. For this, we’ll add a label to the nodes in that pool.
We’ll use an anti-affinity rule to ensure that two of your training pods cannot be scheduled on the same node.
We’ll add a taint to our bust pool nodes to prevent other workloads from running in the burst node pool. We’ll need to add the appropriate toleration to our ML training pods to let them run on those nodes.

Putting it all together

Putting this into practice, we’ll start by creating a cluster with a default node pool that contains only one small node:

PROJECT_ID="apszaz-kube-playground"
GCP_ZONE="europe-west1-b"
GKE_CLUSTER_NAME="burstable-cluster"
GKE_BURST_POOL="burst-zone"

gcloud container clusters create ${GKE_CLUSTER_NAME} \
       --machine-type=g1-small \
       --num-nodes=1 \
       --zone=${GCP_ZONE} \
       --project=${PROJECT_ID}

Now, we’ll add the burst node pool using the following parameters:

--machine-type=n1-highmem-96: this is the instance type we want to use for our ML training job. As opposed to the default node, which contains only one instance of type g1-small.
--accelerator=nvidia-tesla-v100,8: we want 8 NVIDIA TESLA V100 GPUs in each node. These GPUs are not available in all regions and zones, so we will need to find a zone with enough capacity.
--node-labels=gpu=tesla-v100: we add a label to the nodes in the burst pool to allow selecting them in our ML training workload using a node selector.
--node-taints=reserved-pool=true:NoSchedule: we add a taint to the nodes to prevent any other workload from accidentally being scheduled in this node pool.

The rest of the options refer to the autoscaling and are self-explanatory. The full command will look like this:

gcloud container node-pools create ${GKE_BURST_POOL} \
       --cluster=${GKE_CLUSTER_NAME} \
       --machine-type=n1-highmem-96 \
       --accelerator=nvidia-tesla-v100,8 \
       --node-labels=gpu=tesla-v100 \
       --node-taints=reserved-pool=true:NoSchedule  \
       --enable-autoscaling \
       --min-nodes=0 \
       --max-nodes=4 \
       --zone=${GCP_ZONE} \
       --project=${PROJECT_ID}

To test the configuration, we will create a job that will create 4 pods running in parallel during 10 minutes. The pods in our workload will need to have the following elements:

A nodeSelector matching the label we have added to our burst node pool: gpu=tesla-v100.
A podAntiAffinity rule indicating that we do not want two pods with the same label app=greedy-job running in the same node. For this we will add the appropriate label to our pods, and indicate in the topologyKey that it applies at the hostname level (no two such pods in the same node).
Finally, we need to add a toleration to the taint that we attached to the nodes, so these podes are allowed to be scheduled in these nodes.

The full job YAML file (let’s call it greedy_job.yaml) will look like this:

apiVersion: batch/v1
kind: Job
metadata:
  name: greedy-job
spec:
  parallelism: 4
  template:
    metadata:
      name: greedy-job
      labels:
        app: greedy-app
    spec:
      containers:
      - name: busybox
        image: busybox
        args:
        - sleep
        - "300"
      nodeSelector:
        gpu: tesla-v100
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - greedy-app
            topologyKey: "kubernetes.io/hostname"
      tolerations:
      - key: reserved-pool
        operator: Equal
        value: "true"
        effect: NoSchedule
      restartPolicy: OnFailure

Verifying it works

First thing, you will need to get the cluster credentials to be able to run kubectl commands on this cluster:

gcloud container clusters get-credentials ${GKE_CLUSTER_NAME} \
       --zone=${GCP_ZONE} \
       --project=${PROJECT_ID}

We see that, initially, GKE has started the cluster with 3 nodes in the burst pool and one in the default pool:

~ $ kubectl get nodes
NAME                                               STATUS   ROLES    AGE     VERSION
gke-burstable-cluster-burst-zone-183c2d4b-2vw9     Ready    <none>   9m7s    v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-jkzt     Ready    <none>   9m10s   v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-p2w8     Ready    <none>   9m7s    v1.13.11-gke.14
gke-burstable-cluster-default-pool-794fe9e9-jdk3   Ready    <none>   12m     v1.13.11-gke.14

Let’s wait for the cluster to cool down and remove the nodes in the burst pool. After a few minutes, we see all the burst pool nodes have been removed:

NAME                                               STATUS   ROLES    AGE   VERSION
gke-burstable-cluster-default-pool-794fe9e9-jdk3   Ready    <none>   24m   v1.13.11-gke.14

Now that we have our cluster in standby mode (with no nodes in the burst pool), we can start running our test. We will use the job we defined in the previous section (we called it greedy_job.yaml). This job will run four processes that will run in parallel and that will complete after 10 minutes.

Initially we have no pods running in the default namespace:

~ $ kubectl get pods
No resources found.

and, as we saw earlier, only the dummy node from the default node pool:

~ $ kubectl get nodes
NAME                                               STATUS   ROLES    AGE   VERSION
gke-burstable-cluster-default-pool-794fe9e9-jdk3   Ready    <none>   26m   v1.13.11-gke.14

If we apply our job:

~ $ kubectl apply -f greedy_job.yaml
job.batch/greedy-job created

We see that the pods are created, but are pending for a little while:

~ $ kubectl get pod
NAME               READY   STATUS    RESTARTS   AGE
greedy-job-9wlb8   0/1     Pending   0          8s
greedy-job-hr2tc   0/1     Pending   0          8s
greedy-job-lqshk   0/1     Pending   0          8s
greedy-job-mcbmm   0/1     Pending   0          8s

If you look at the events in one of the pods, you will see it has triggered a cluster scale up event:

~ $ kubectl describe pod greedy-job-9wlb8 
Name:           greedy-job-9wlb8
Namespace:      default
...
Events:
  Type     Reason            Age   From                Message
  ----     ------            ----  ----                -------
  Warning  FailedScheduling  26s   default-scheduler   0/1 nodes are available: 1 node(s) didn't match node selector.
  Normal   TriggeredScaleUp  20s   cluster-autoscaler  pod triggered scale-up: [{https://content.googleapis.com/... 0->1 (max: 4)}]

We see that little by little the pods are stating to run:

~ $ kubectl get pod -o wide
NAME               READY   STATUS              RESTARTS   AGE     IP          NODE                                             NOMINATED NODE   READINESS GATES
greedy-job-9wlb8   1/1     Running             0          2m47s   10.16.1.2   gke-burstable-cluster-burst-zone-183c2d4b-n1f3   <none>           <none>
greedy-job-hr2tc   1/1     Running             0          2m47s   10.16.2.2   gke-burstable-cluster-burst-zone-183c2d4b-sf5r   <none>           <none>
greedy-job-lqshk   0/1     Pending             0          2m47s   <none>      <none>                                           <none>           <none>
greedy-job-mcbmm   0/1     ContainerCreating   0          2m47s   <none>      gke-burstable-cluster-burst-zone-183c2d4b-jm49   <none>           <none>

check again after a few seconds:

~ $ kubectl get pod -o wide
NAME               READY   STATUS    RESTARTS   AGE     IP          NODE                                             NOMINATED NODE   READINESS GATES
greedy-job-9wlb8   1/1     Running   0          4m27s   10.16.1.2   gke-burstable-cluster-burst-zone-183c2d4b-n1f3   <none>           <none>
greedy-job-hr2tc   1/1     Running   0          4m27s   10.16.2.2   gke-burstable-cluster-burst-zone-183c2d4b-sf5r   <none>           <none>
greedy-job-lqshk   1/1     Running   0          4m27s   10.16.4.2   gke-burstable-cluster-burst-zone-183c2d4b-kbw2   <none>           <none>
greedy-job-mcbmm   1/1     Running   0          4m27s   10.16.3.2   gke-burstable-cluster-burst-zone-183c2d4b-jm49   <none>           <none>

And, once the 10 minutes are passed, the pods terminate:

~ $ kubectl get pod -o wide
NAME               READY   STATUS      RESTARTS   AGE     IP          NODE                                             NOMINATED NODE   READINESS GATES
greedy-job-9wlb8   0/1     Completed   0          7m58s   10.16.1.2   gke-burstable-cluster-burst-zone-183c2d4b-n1f3   <none>           <none>
greedy-job-hr2tc   0/1     Completed   0          7m58s   10.16.2.2   gke-burstable-cluster-burst-zone-183c2d4b-sf5r   <none>           <none>
greedy-job-lqshk   1/1     Running     0          7m58s   10.16.4.2   gke-burstable-cluster-burst-zone-183c2d4b-kbw2   <none>           <none>
greedy-job-mcbmm   0/1     Completed   0          7m58s   10.16.3.2   gke-burstable-cluster-burst-zone-183c2d4b-jm49   <none>           <none>

If we keep looking at our nodes, we see that, after 10 minutes or so since the first batch job completed, their status becomes NotReady (they are being drained), and finally disappear. You can use this command to see the list of nodes every 60 seconds:

while true; do kubectl get nodes ; sleep 60; done

The output looks something like this:

NAME                                               STATUS   ROLES    AGE   VERSION
gke-burstable-cluster-burst-zone-183c2d4b-jm49     Ready    <none>   14m   v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-kbw2     Ready    <none>   13m   v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-n1f3     Ready    <none>   16m   v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-sf5r     Ready    <none>   15m   v1.13.11-gke.14
gke-burstable-cluster-default-pool-794fe9e9-jdk3   Ready    <none>   45m   v1.13.11-gke.14
NAME                                               STATUS     ROLES    AGE   VERSION
gke-burstable-cluster-burst-zone-183c2d4b-jm49     Ready      <none>   15m   v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-kbw2     Ready      <none>   14m   v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-n1f3     Ready      <none>   17m   v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-sf5r     NotReady   <none>   16m   v1.13.11-gke.14
gke-burstable-cluster-default-pool-794fe9e9-jdk3   Ready      <none>   46m   v1.13.11-gke.14
NAME                                               STATUS     ROLES    AGE   VERSION
gke-burstable-cluster-burst-zone-183c2d4b-jm49     NotReady   <none>   16m   v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-kbw2     Ready      <none>   15m   v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-n1f3     Ready      <none>   18m   v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-sf5r     NotReady   <none>   17m   v1.13.11-gke.14
gke-burstable-cluster-default-pool-794fe9e9-jdk3   Ready      <none>   47m   v1.13.11-gke.14
NAME                                               STATUS     ROLES    AGE   VERSION
gke-burstable-cluster-burst-zone-183c2d4b-kbw2     NotReady   <none>   16m   v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-n1f3     Ready      <none>   19m   v1.13.11-gke.14
gke-burstable-cluster-default-pool-794fe9e9-jdk3   Ready      <none>   48m   v1.13.11-gke.14
NAME                                               STATUS     ROLES    AGE   VERSION
gke-burstable-cluster-burst-zone-183c2d4b-kbw2     NotReady   <none>   17m   v1.13.11-gke.14
gke-burstable-cluster-burst-zone-183c2d4b-n1f3     Ready      <none>   20m   v1.13.11-gke.14
gke-burstable-cluster-default-pool-794fe9e9-jdk3   Ready      <none>   49m   v1.13.11-gke.14
NAME                                               STATUS     ROLES    AGE   VERSION
gke-burstable-cluster-burst-zone-183c2d4b-n1f3     NotReady   <none>   21m   v1.13.11-gke.14
gke-burstable-cluster-default-pool-794fe9e9-jdk3   Ready      <none>   50m   v1.13.11-gke.14
NAME                                               STATUS     ROLES    AGE   VERSION
gke-burstable-cluster-burst-zone-183c2d4b-n1f3     NotReady   <none>   22m   v1.13.11-gke.14
gke-burstable-cluster-default-pool-794fe9e9-jdk3   Ready      <none>   51m   v1.13.11-gke.14
NAME                                               STATUS   ROLES    AGE   VERSION
gke-burstable-cluster-default-pool-794fe9e9-jdk3   Ready    <none>   52m   v1.13.11-gke.14

After a few minutes, all the burst nodes have been removed, and only the g1-small node from the default pool remains.

Conclusion

We managed to create a cluster that scales down to one very small (and inexpensive) node just by using:

GKE’s cluster autoscaling.
The kubernetes features node selector, anti-affinity rules and taints and tolerations.

If you do not care that much about the actual node specs, you can also use node auto-provisioning. With node auto-provisioning, new node pools can be created and deleted automatically based on the specifications of unschedulable Pods.