Why and how do we run Kubernetes on the Spot instances

YR
Preply Engineering Blog
9 min readOct 30, 2018

My name is Yurii Rochniak, and I’m a Platform Engineer @ Preply. Today I want to share our experience of running a Kubernetes cluster on AWS Spot instances, and the benefits and drawbacks of this approach.

tl;dr: The “Wrapping Up” section contains some ready-to-go code snippets.

The Why

When AWS announced that their Elastic Kubernetes Engine (EKS) was publicly available, one of the biggest questions on Reddit was: “How much would it cost?” The answer didn’t attract much enthusiasm. Unlike GKE (Google) or AKS (Microsoft) AWS charges its users for the Kubernetes control plane, which is approximately $150 per month. Of course, they do promise a high availability and fault tolerance with multi-AZ and the managed control plane. And, AWS Spot instances can be about 30% of the cost of on-demand machines of the same size because these are sold in an auction-like manner. Sure, you can use the Spot instances for the worker nodes, but it’s a bit expensive for those who want to play a bit with EKS. Plus, you’d still need to pay for the control plane. However, if you spin up a cluster on your own for research and development purposes, you can use a smaller instance type for the master and worker nodes.

When EKS became available, we were considering Kubernetes as a generic platform solution for all our services. Hence, we required a “sandbox” to play around with k8s. Also, we had a business goal to provide dedicated staging environments for each new feature. At the same time we didn’t want to pay $150 per month just for our internal tests. So, we had to find another way.

AWS Spot Instances are unused EC2 machines that are available for less than the On-Demand price. Therefore, you can significantly save on your infrastructure costs. It does come with a price, though. AWS can take away your Spot instance at any time, while other clients claim a compute capacity for an on-demand cost. Also, you must bid for a Spot instance, so when the price increases, your bid may just not be enough. (UPD: according to the new AWS pricing model for the spot instances you don’t have to monitor the market situation continuously and the prices are more predictable. AWS can still take away your instances in case of unavailable spot capacity or low maximum bid, though.)

There is, however, a tool set around Kubernetes to manage it.

How

We deploy our clusters with Kops². This is a tool which helps you to create and operate Kubernetes clusters in public clouds. Both master and worker nodes are part of the instance group. In this example, we focus on the worker nodes and get back to the master node later. If you explicitly declare a bid price in the config, Kops is smart enough to use Spot Instances. You can specify the size of your autoscaling group there as well.

...
machineType: m4.xlarge
maxPrice: "0.07"
maxSize: 5
minSize: 1
...

Cool! Now you have Kubernetes nodes running on Spot instances! However, there’s more to come.

Cluster Autoscaler

Kops creates the nodes as a part of the AWS autoscaling group, but it doesn’t manage scaling policies. However, there is an instrument called Cluster Autoscaler, which can do that for you. It’s a tool which continuously monitors the load of your cluster and adjusts a number of nodes accordingly for optimal performance.

You can easily install it with Helm:

helm install stable/cluster-autoscaler --name autoscaler -f your-custom-values.yaml

Where custom-values.yaml are the same as below:

autoDiscovery:
# Name of the cluster for auto discovery
clusterName: my-cluster
awsRegion: eu-west-1
cloudProvider: aws
extraArgs:
# Detect similar node groups and balance the number of nodes between them.
# Since we have similar IG per AZ, it's good to switch on this option
balance-similar-node-groups: true
# Algorithm, which autoscaler uses for re-scheduling pods
expander: random
# Since kops configuration allows kube-system pods to go to the all nodes, we need this option in order to downscale the cluster
skip-nodes-with-system-pods: "false"
rbac:
create: true
pspEnabled: true
scale-down-delay: 5m
# Verbosity
v: 2

And you’ll see that cluster autoscaler requires special labels in order to work:

spec:
cloudLabels:
k8s.io/cluster-autoscaler/enabled: ""
kubernetes.io/cluster/mycluster: ""

Important notice: You can assign these labels to the existing cluster. They are applied to the existing autoscaling group and to all of the new instances within that group. However, kops won’t apply them automatically to the existing nodes. Therefore, you need to either re-create the nodes or add these labels manually via AWS Console or CLI.

Also, by default Cluster Autoscaler doesn’t remove the nodes which contain pods from the kube-system namespace. Kops in turn doesn’t manage any scheduling rules for the resources within the kube-system namespace. Thus, such pods may go to whatever node. This means that Cluster Autoscaler may not downscale your cluster and you will have unused computing resources.

There are several solutions to this:

  • Schedule the kube-system pods to the Master nodes, which is not flexible
  • Create a dedicated instance group for the kube-system resources, which requires additional operational work
  • Adjust the Cluster Autoscaler settings, which could be done with a single setting

We’ve chosen the third way. Cluster Autoscaler has an option to skip-nodes-with-system-pods which is true by default. If you set this option to false, it can drain and therefore scale-down all the nodes even if pods from the kube-system namespace are there.

There are a few things you should keep in mind though:

  • There is an ambivalent Helm chart behaviour. If you passes an argument as a “string” e.g. skip-nodes-with-pods: "false" it works well. However, if you pass this parameter without quotes (as a boolean) this argument will still be set to true. Such behaviour caused by the fact, that the Helm template has if $value check
  • You’d better have several replicas for pods in the kube-system namespace if you want to leverage this approach because now Cluster Autoscaler can remove these pods from the worker nodes
  • You’ll need to add labels to the existing nodes manually

A Spot Termination Notice Handler

Another useful tool for clusters on spots is the Spot Termination Notice Handler. It’s a DaemonSet, which polls the EC2 Spot Instance Termination Notices. So, it drains the node in advance before AWS takes it away. Hence, you can be sure that your applications are gracefully re-scheduled to the other nodes in the cluster. In case there is no sufficient capacity, Cluster Autoscaler can figure it out for you.

You can install the Spot Termination Notice Handler with Helm as well:

helm install incubator/kube-spot-termination-notice-handler 

There is a Helm Chart, which has some interesting options. For example, you can configure Slack notifications once a termination notice for your node appears.

Instance Types

This is some general advice for running your workloads on Spot Instances.

The AWS spot capacity varies from one availability zone to another, from instance type X to instance type Y. This means there might be not enough spot machines for your choice in a given zone. From our experience, it’s better not to use brand new instance types because there might be very few of them in the AWS datacenter. For example, we tried to use the m5d.xlarge instance type for one of our workloads (not the Kubernetes cluster) and got a “no sufficient capacity” error very often. The issue was fixed by switching to the previous generation type: m4.xlarge. AWS has plenty of them in different locations.

If you decided to stick with on-demand (or reserved) instances, I recommend you to use the newest instance types. However, it may be not the case for the spots because these are the spare capacity of the AWS datacenter, and there maybe not many new types available because other companies are using them.

What About The Volumes?

As you may know, EBS volumes in AWS are availability zone-bound. Hence, you cannot attach an EBS from zone A to a node in zone B. It may be an issue with multiAZ Kubernetes installation, when your pods may be re-scheduled at any time.

Regarding Spot instances in multiAZ is a valuable thing in any case. A Spot price may vary in different zones, and it could happen that your bid price is lower than a current spot price for one zone, but is entirely okay for another. Therefore, you may want to split your cluster to the different AZs. Be aware, some problems can appear.

If the Kubernetes scheduler moves a pod with a persistent volume claim to a node in a different AZ, the volume won’t be able to mount. Therefore, your app won’t be able to start.

We have a few suggestions for this:

  • Try to use persistent volumes only if they are required. We are using spots for our Dev and QA environments. Hence, we don’t care about the data there because we can re-create an entire environment from scratch with some staging data if required. If this is an option for you, consider not using persistent volumes even for stateful apps
  • Use Affinity Rules and Tolerations to bind pods with PVCs to the specific AZ. If you choose this approach, the next point will be interesting for you as well
  • You can also use AWS NFS, which is availability zone agnostic. We haven’t tested this setup, and I’ll be very happy if you share some insights from your experience in such a setup
  • Configure separate Instance Groups in kops for each availability zone. It doesn’t resolve the given issue entirely, but it eliminates some pain in the node scheduling. For example, you can tell kops to have at least one node per AZ. Then, you’ll be sure that you have room for pods in each AZ.

The Whole Configuration

This is what our configuration looks like.

Here is a kops config for a single-AZ instance group:

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: null
labels:
kops.k8s.io/cluster: myCluster
name: nodes-eu-west-1a
spec:
cloudLabels:
k8s.io/cluster-autoscaler/enabled: ""
kubernetes.io/cluster/myCluster: ""
image: kope.io/k8s-1.10-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: m4.xlarge
maxPrice: "0.07"
maxSize: 5
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: nodes-eu-west-1a
role: Node
subnets:
- eu-west-1a

We are using Debian Stretch images because they can work with a 5th generation of AWS instances. However, as I previously mentioned, it’s better not to use the latest generation instances with on spots.

Here are the cluster Autoscaler Helm values:

autoDiscovery:
clusterName: myCluster
awsRegion: eu-west-1
cloudProvider: aws
extraArgs:
balance-similar-node-groups: true
expander: random
skip-nodes-with-system-pods=false: false
rbac:
create: true
pspEnabled: true
scale-down-delay: 5m
v: 2

And Spot Termination Notice Handler Helm values:

clusterName: myCluster
slackUrl: <SECRET>

Wrapping Up

Despite managing Kubernetes as a Service reduces the amount of operational work, it may be pricey. Spot instances may reduce the cost significantly, which is especially cool for Dev and QA environments. Thankfully, there are ready-to-use tools, which can help you create a friction-less yet still secure and reliable Kubernetes installation in AWS like Kops, Cluster Autoscaler, and Spot Termination Notice Handler.

You can even use Spot instances in production if you’re brave enough. Just make sure that you have a fallback to on-demand machines.

There is also a service called Spotinst.com, which provides scheduling of the different services (including Kubernetes) on AWS Spot Instances, Google Preemptible VMs, and Azure Low-Priority VMs. However, we don’t use it.

**Liked the article? Give it a round of applause, or share your own experiences in the comments!**

--

--