Reduce costs with spot pods on GKE Autopilot

5 min readAug 21, 2022

In this article we will introduce the concept of spot instances and how we can use them in Kubernetes, especially on GKE Autopilot for running development and production workloads.

Photo by Daniel van den Berg on Unsplash

Spot instances

If you are already familiar with the concept, feel free to skip this section.

Basically the term spot in cloud providers refers to using their spare compute capacity in a given region — “on the spot” — that they can take away from you anytime. This comes with a pretty noticeable price benefit though, making it ideal for short lived or stateless workloads that are quick to start again.

All the big cloud providers have a similar offering, with differences on the discount percentage (usually between 60–90%) and the notice period, which is the time you have to safely exit your processes before the machine is taken away (this is 2 minutes on AWS and 30 seconds on GCP).

Apart from saving a lot of operational costs, spot instances have a passive benefit as well. They improve your application’s design by forcing you to create it in a way that allows it to tolerate unplanned stops, makes capable of fast boot up and generally promotes statelessness. This is especially useful in microservice architectures, where the application generally consists of multiple containers that scale up and down elastically and benefit a lot of the above mentioned properties.

Spot Pods in GKE Autopilot

Spot VMs are available for the normal GKE clusters as well, but this article will focus on Autopilot, which is a newer, more managed version of GKE that handles the automatic scaling of the underlying nodes (and includes built-in security and networking best practices).

Because we can’t access the nodes in Autopilot, they came up with the concept of Spot pods, so users can still benefit of the cost reduction of the spot compute instances. In the most basic use case, it is enough to set a simple nodeSelector for the Pod you wish to run on a spot instance and Autopilot will create a new worker node that will have taints that ensure that only spot pods can run on it. It will also have the following label that the nodeSelector references and causes the pod to be deployed only to spot VMs: cloud.google.com/gke-spot: "true"(check the documentation for a complete example).

Also worth mentioning that when the spot compute instance is taken away, the pod will have exactly 25 seconds to gracefully exit, otherwise it will be killed. The remaining 5 seconds are kept for exiting the system pods.

Mixing normal and spot pods

The above use case works well for deployments that can run purely on spot instances. These can be non-production environments that can handle eventual downtime or short-lived jobs. However, it is possible to use spot pods for some production workloads as well, depending on the nature of the deployment (how stateless it and and how fast it can start) and its mission criticality. It is done by using a weighted nodeAffinity like so:

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 50
      preference:
        matchExpressions:
        - key: cloud.google.com/gke-spot
          operator: In
          values:
            - "true"

This will try to run half of the replicas as spot pods and the other half as normal ones. This will not always succeed, depending on the available free resources on the already existing nodes. So in situations when, let’s say, you have 2 replicas, it can happen that both of them run as spot or as normal pods. Autopilot will try to schedule pods first on the available spot VMs, then the normal ones. If none of them have enough free capacity, then it will create new spot nodes, then as a last attempt, new normal nodes.

The requiredDuringScheduledIgnoredDuringExecution affinity will not work when you are trying to mix spot pods with normal ones, because we cannot define the weight there. Also, the normal compute instances do not have the cloud.google.com/gke-spot label, that is the reason we cannot use the newer Topology Spread Constraints approach (as presented nicely in this article) either.

Finally, there is an other way to make sure that more or less equal amounts of replicas run as spot and normal pods: by creating 2 separate deployments. A Google engineer suggested it, but I didn’t implement it, because it would be cumbersome to change multiple (tens of) microservices and error-prone to always change the values for both of the deployments and monitor them separately (even though they represent the same application).

Make it more reliable

Now that we have our production deployments running “half” as spot instances, we might be thinking how to ensure that when a spot VM is taken away, it doesn’t make our application unavailable for some time (in case all the replicas are running accidentally as spot pods on the same node). This can be achieved using a podAntiAffinity rule like so:

podAntiAffinity:
  preferredDuringSchedulingIgnoredDuringExecution:
  - weight: 100
    podAffinityTerm:
      labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - {{ .Values.deploymentName }}
      topologyKey: kubernetes.io/hostname

This will spread out all the pods belonging to the same deployment to separate nodes. We can’t use required here because of an other Autopilot caveat. If the pod requires less than 0.5 CPU, it won’t get deployed if the available nodes are all full. This is because Autopilot only creates new nodes when the requested CPU amount is at least 0.5 cores. If you don’t have such scenario and all your pods need more than that, feel free to use requiredDuringSchedulingIgnoredDuringExecution, thus making sure that the pods will be assigned to separate nodes in all cases. From what I’ve seen though, the above will work just fine in 99% of the time.

Observations

If you would like to always have some spare capacity available to schedule on, use balloon pods combined with the above described podAntiAffinity to have nodes available.

Daemonsets, used for example by monitoring solutions, such as Prometheus or Datadog, don’t have by default a toleration to spot instances, so they won’t be scheduled on those. You can add a toleration to these resources like this:

tolerations: 
- key: cloud.google.com/gke-spot
  operator: Equal
  value: "true"
  effect: NoSchedule

Let me know if you have better ways of using spot pods on Autopilot in the comments.

If you liked this article and would like to read similar ones in the future, consider giving a clap and following me. It makes me happy and keeps me motivated if I see that readers enjoy my work :)