Reduce Kubernetes Infrastructure cost with EC2 Spot Instances — Part 1

Jaison Netto
upday devs
Published in
6 min readJan 22, 2021
$$$ // photo from my camera roll

This post is part one of two series about using Amazon EC2 Spot instances as Kubernetes worker nodes.

  • Part 1 is more focused on our journey of moving to Spot instance worker nodes.
  • Part 2 will dive into technical aspects and best practices of using Spot instances in Kubernetes.

To begin, let us understand what a Spot instance is.

Whenever you create an EC2 instance in AWS, you are most likely creating an on-demand EC2 instance. This means you requested for an EC2 instance, AWS provided that and you are charged for it. AWS holds a lot of physical machines for handling requests like that. AWS introduced the concept of Spot instance so that they can make money out of this capacity, which is not being used, but at the same time is also available whenever there is demand. The result is, you can get the same computing capacity with less than half of its price, but the instance might go unavailable (Spot interruption) with a small window of notification time (2 minutes). So, if your workflow is either not affected by this kind of interruption or if you have a way to handle the interruptions, a Spot instance could be a good option for you.

Initial Analysis:

We broke down our initial analysis into the following steps:

  1. Investigating the Spot instance pricing, interruption history, optimal instance sizes
  2. Implementing Spot instance as EKS worker nodes
  3. Handling Spot interruptions

Initial analysis on instance types:

Investigating the Spot instance pricing, interruption history, and optimal instance types are a manual process where we must look at pricing history and interruption histories in general. We also have to look at what kind of instance types we are currently using and what are all the possible Spot instance options to replace those.

For example: If you are using an m5.xlarge instance for a worker group, you can consider having all the below options.

m5.xlarge, m5a.xlarge, m4.xlarge, m5n.xlarge, m5d.xlarge, m5nd.xlarge

You can also consider mixing it with other instance types like the c5/c4 or r5/r4 as well if they fit in fine for your use case and capacity.

The point of giving multiple instance types is to ensure that we have a big pool of options and thereby the chances of getting a proper Spot instance whenever required. Let’s imagine you limited yourself to only one possible instance type for the Spot request, it means if that exact instance type is not available at that time, you will not be getting a new instance provisioned. To avoid such a situation, it is better to have a variety of instance types for the Spot request. You must choose this based on the kind of capacity you need, Spot price history, etc.

Initial analysis on Implementation:

There are multiple ways to implement this.

  1. Launch Configuration
  2. Launch Template

Launch Configuration is a template with necessary configurations like instance type, security group, etc. Auto Scaling Group will be using Launch Configuration to launch a new instance based on the scaling requirement it gets.

Within Launch Configuration, we have an option of purchasing Spot instances over on-demand instances. You can define whether you want to use on-demand or Spot instance but not a combination of both. Since you cannot use Spot and on-demand within the same worker group, you must look at your current worker configurations and decide how much can be on Spot instances, and split the worker group into two with separate Launch Configuration. Also, you must make associated changes in the K8s manifest to point which pod/deployment goes to Spot/on-demand.

Launch Template which is a relatively newer implementation has a better way of solving this problem where we can allocate Spot and on-demand under the same worker group. This method is the one to go for if you really need the flexibility of mixing Spot and on-demand within the same worker group itself.

Handling Spot Interruptions:

Demand for Spot instances can vary significantly from moment to moment, and the availability of Spot instances can also vary significantly depending on how many unused EC2 instances are available. It is always possible that your Spot instance might be interrupted and how we handle these interruptions, without interrupting our service delivery, is a particularly important part. AWS Node Termination Handler aws-node-termination-handler is the key to solving this problem. This handler will run a small pod on each host to perform monitoring and react accordingly. It listens to the EC2 instance metadata and when it detects a Spot interruption notification in the instance metadata, it uses the Kubernetes API to cordon the node to ensure no new work is scheduled there, then drains it, removing any existing work.

Implementation:

We used the data from our initial analysis to conclude the set of instance types that need to be used and the kind of ratio we need for on-demand to Spot. Also, it was certain that we wanted more of a flexible implementation and hence decided to go with the Launch Template option.

We are using Terraform EKS module to create our EKS infrastructure and were using Launch Configuration. Since the module supports Launch Template as well, we had to make some associated changes in Terraform to make it use the Launch Template instead of the Launch Configuration.

Node termination handler can be deployed to the cluster using helm or directly using manifests. Using helm is an easier approach with much more control on the configuration part as well. We configured the values in such a way that we run the node termination handler only on Spot instances and receive a notification to our slack channel whenever a node gets terminated.

We were very certain about the availability of our environment and hence wanted to avoid any issues that can arise as part of Spot implementation. One issue we could think of was the unavailability of any Spot instance (though we have given enough choices on instance types) in an Auto Scaling Group. A situation like this can put the application performance at risk and will affect the scaling capabilities (at least for some time). To identify such a situation as early as possible and handle it better, we created an EventBridge in AWS to identify these kinds of events and trigger a Lambda function that will notify us through slack and email.

How much can we save?

Here is the interesting part and let us go through some numbers. AWS says you might be saving close to 60–70% when you use an equivalent Spot instance instead of the one you are using right now. Considering the variation in Spot pricing and availability, you should be able to save at least 40 to 50%, if you are moving all your workflow from on-demand to Spot. Based on your infrastructure requirements, you might be either running the entire workflow in Spot instance or only part of it. Based on all these considerations, your savings might be different. We followed a 100% Spot on some of our non-production environments where we only considered cost-saving. However, there are other production environments where we run on-demand and Spot hand in hand to get the best out of cost savings and capacity.

To conclude, with this implementation, all our Kubernetes worker groups could utilize a mix of on-demand and Spot instances and we made a 30% savings on our Kubernetes infrastructure cost.

Continued in Part 2, is all the technical details and best practices.

--

--

Jaison Netto
upday devs

DevOps Engineer at upday, an Axel Springer SE company