Reduce Kubernetes Infrastructure cost with EC2 Spot Instances — Part 2

Published in

upday devs

8 min readApr 30, 2021

This post is part two of the series about using Amazon EC2 Spot instances as Kubernetes worker nodes. You can read Part 1 on our journey to Spot instance here.

This part assumes technical expertise along with some working knowledge on AWS and Kubernetes.

The following topics are discussed under this post:

EC2 Instance types
Launch Template
Terraform
Node Termination Handler
Notification/Alerts

Choosing EC2 Instance Types

Choosing the right instance types is very important when we start using Spot instances. The following points will help you choose the right set of instance types for your use case.

Understand the CPU and Memory requirements for a Worker Group and get all possible instance types that can give similar resources
Add at least 5 to 6 instance types if you are deciding to use 100% Spot for that Worker Group
Look for the pricing history of the last 3 months and see if it make sense to use those.
Look for the interruption history of all the instances.
Group instances based on type and generation. For example, you can use m5.large, m5a.large, m5ad.large, m5d.large, c5.large, c5a.large, m4.large, c4.large, etc. If you prefer to use only CPU-optimized instances for a workflow, consider all the possible C instance types.
The cheapest on-demand instance might not be the cheapest Spot instance.

Launch Template

The purchase options and instance types section in the Launch Template provides us with the options to use Spot instance. There are certain terminologies used under Launch Template to set up your Worker Group with Spot instance. Let’s have a look at those.

On-demand base capacity

This option defines whether you want to use on-demand along with Spot instances or not. It accepts any number including zero. For example, zero means you don’t need any on-demand instances, but if you set it to 2, the first two nodes in the ASG will be on-demand, and the rest can be on-demand or spot based on other parameters.

On-demand percentage above base

This is the option that defines whether to use Spot or on-demand when the ASG is scaling up from the base capacity. This value is in percentage. For example, If you set it to 25%, it means that out of 4 nodes that are created as part of scaling, 1 will be on-demand and the others will be Spot.

Spot allocation strategy

This option defines the strategy to use for allocating Spot instances. AWS supports two strategies, Lowest Price and Capacity Optimised

LowestPrice

This strategy creates multiple pools of instances based on the instance types we provide and Spot Instances are provisioned from the Spot capacity pool with the lowest price.

CapacityOptimised

Instance types are chosen based on the real-time capacity data and predictions on the available capacity so that the interruptions are minimal.

Capacity Rebalancing

You can configure Spot Fleet to launch a replacement Spot Instance when Amazon EC2 emits a rebalance recommendation to notify you that a Spot Instance is at an elevated risk of interruption. Capacity Rebalancing helps you maintain workload availability by proactively augmenting your fleet with a new Spot Instance before a running instance is interrupted by Amazon EC2.

Terraform

Let’s look at the Terraform snippet for creating the Worker Groups with the Launch Template. You will be using all the parameters we discussed under the Launch Template here. You can look at the below snippet for two Worker Group definitions.

Worker Group1:
------{
   name                      = "test_fullspot"
   override_instance_types   = ["t3a.xlarge", "t3.xlarge", "t2.xlarge", "m5a.xlarge", "m5.xlarge", "m4.xlarge", "m5n.xlarge"]
   spot_instance_pools       = 6
   spot_allocation_strategy  = "lowest-price"
   kubelet_extra_args        = "--node-labels=cluster=mytestcluster,purpose=mypurpose --node-labels=node.kubernetes.io/lifecycle=`curl -s http://169.254.169.254/latest/meta-data/instance-life-cycle`"
   asg_desired_capacity      = 2
   asg_min_size              = 2
   asg_max_size              = 25
   root_volume_size          = 50
   root_volume_type          = "gp3"
}Worker Group2:
------{
   name                      = "test_spot_and_on-demand"
   override_instance_types   = ["t3a.xlarge", "t3.xlarge", "t2.xlarge", "m5a.xlarge", "m5.xlarge", "m4.xlarge", "m5n.xlarge"]
   spot_instance_pools       = 0
   on_demand_base_capacity   = 2
   on_demand_percentage_above_base_capacity = 25
   spot_allocation_strategy  = "capacity-optimized"
   kubelet_extra_args        = "--node-labels=cluster=mytestcluster,purpose=mypurpose --node-labels=node.kubernetes.io/lifecycle=`curl -s http://169.254.169.254/latest/meta-data/instance-life-cycle`"
   asg_desired_capacity      = 2
   asg_min_size              = 2
   asg_max_size              = 10
   root_volume_size          = 50
   root_volume_type          = "gp3"
}

If you look at the first Worker Group (test_fullspot), you can observe the following:

on_demand_base_capacity is not mentioned. This means on-demand instances won’t be used for handling base capacity.
on_demand_percentage_above_base_capacity is not mentioned, which means on-demand instances won’t be used while scaling up.
Spot_instance_pools is given as 6, this means the Launch Template can create 6 possible sets with all the instance types we have given. For example, if we require to run 3 nodes, one set can be three t3a.xlarge and another set can be two t3a.xlarge and one t3.xlarge, and so on.
Spot_allocation_strategy is given as lowest-price. This means the Launch Template will consider the instance pool with the least cost regardless of availability/interruption history or any other factors.

Let’s look at the 2nd Worker Group (test_spot_and_on-demand) now.

on_demand_base_capacity is set to 2. This config ensures that the first two nodes will be on-demand instances.
on_demand_percentage_above_base_capacity is set to 25, which translates to 1 on-demand instance per 3 Spot instances when scaling up.
Spot_allocation_strategy is set to capacity-optimized. Launch Template will not try to blindly save cost here, rather it will use a mix of instances to ensure that the capacity is not interrupted by simultaneous spot terminations.
Spot_instace_pools is set to zero as the capacity-optimized strategy does not use pools.

Both the Worker Groups are good in their own way if we use it in the appropriate workflows. We can use the first Worker Group (100% Spot) in any non-critical use case, whereas the second can be used for any general workflow and still save a lot.

AWS Node Termination Handler

Let’s look at the AWS Node Termination handler setup. Node Termination Handler should be run as a Kubernetes daemon set. It can be installed using helm in one of the two ways:

Add the helm repo eps-charts and install Node Termination Handler using helm install command with all necessary parameters.
Download/clone the chart and then make changes to the config by updating the values.yml and then install the chart.

Some parameters that we need to consider in the Node Termination Handler configuration;

Node selector

Node selector gives you control to run the Node Termination Handler either on all the nodes in your cluster or only on the spot instance. The ideal situation is to run it only on Spot instances as the other nodes might not get interrupted/stopped (unless there are any maintenance or issues in the hardware node). You can use the below snippet to run the Spot termination handler only on the Spot instance. If you look at our terraform code, you can see that we are setting this label in all the Spot instances using the EC2-metadata.

nodeSelector: {
   node.kubernetes.io/lifecycle: spot
}

Webhook URL

It is an optional parameter that we can use to notify about the events that the Spot termination handler is getting (Spot interruptions). You can provide any webhook URL here and the Node Termination Handler will send the events to it. We are using a slack webhook here so that we get notified whenever an interruption is about to happen.

webhookURL: “https://hooks.slack.com/services/xxxx/ssssess"

Webhook Template

You can customize the notification message that’s sent to the webhook URL using the webhook template.

webhookTemplate: “{\”text\”:\”:rotating_light:*INSTANCE INTERRUPTION NOTICE*:rotating_light:\n*_EventID:_* `{{ .EventID }}`\n*_Environment:_* `<env_name>`\n*_InstanceId:_* `{{ .InstanceID }}`\n*_InstanceType:_* `{{ .InstanceType }}`\n*_Start Time:_* `{{ .StartTime }}`\n*_Description:_* {{ .Description }}\”}”

Once the configuration is made, we can install the helm chart using the helm install command.

Let’s see how the Node Termination Handler works when one of the Spot instances gets a termination notice.

AWS pushes this termination notice through instance metadata.
Node Termination Handler DaemonSet running in the node will get this metadata information from the EC2 instance metadata
Once it gets the metadata, it uses Kubernetes API to cordon the node to ensure no new work is scheduled there.
It drains the node using Kubernetes API.
Sends a notification to the webhook URL about the Spot interruption.
The Spot instance which got the termination notice will get terminated when the notice window expires.
cluster-auto-scaler finds the need for a new instance and adds another node (Spot) to the ASG.

Let’s look at a real-time example in which a Spot instance in the Staging environment got an interruption notice. Below are the logs and the notification details.

2020/10/15 13:16:46 Got interruption event from channel {InstanceID:i-xx13xx015f7xx
InstanceType:m4.xlarge PublicHostname: PublicIP: LocalHostname:ip-10-xx-96-yy9.yy-west-1.compute.internal
LocalIP:10.41.96.249 AvailabilityZone:xx-west-1c}
{EventID:spot-itn-xxx62f74d4dda8a12XXXXXXXX3dfb1e5b1caafedf059fc5a1e
Kind:SPOT_ITN
Description:Spot ITN received. Instance will be interrupted at 2020-10-15T13:18:46.155Z
State: NodeName:ip-10-xx-yy-xx9.xx-west-1.compute.internal
StartTime:2021-02-15 13:18:46.155 +0000 UTC EndTime:0001-01-01 00:00:00 +0000 UTC Drained:false PreDrainTask:0x113c8a0 PostDrainTask:<nil>}
2020/10/15 13:16:46 Node "ip-10-xx-yy-x9.xx-west-1.compute.internal" successfully cordoned and drained.
2020/10/15 13:16:46 Webhook Success: Notification Sent!

You can see steps 1 to 5 in the logs.

Notification about Spot Unavailability

After setting up all the above things, is this really required? Yes, though we have given a wide range of instance types for all our Worker Groups, there might be a situation where none of the instance types are available in Spot capacity when the ASG tries to provision a new one(Scale-up).

The Launch Template tries to create an instance with any of the instance types that we have mentioned and if it can’t get any in a specified time, it will retry again. This process can get repeated if the Spot capacity is not yet available. Your application/environment might become degraded as it's not getting resources to scale up.

To handle such a rare situation, we can set up a small workflow.

Create an event Rule under AWS EventBridge
Write a pattern to filter EC2 Instance Launch Unsuccessful events
Create a Lambda to process/filter events, perform required actions/notify the stakeholders.
Set the created Lambda as a target in the Event Rule

A sample notification sent using the above workflow is shown below.

Notification about instance Launch Failure

What’s Next

One addition we want to make to this implementation is to accumulate all the spot provision and interruption events, we can then analyze this data and plot it so that we can have a better understanding of what’s happening by looking at a dashboard.

Conclusion

Our non-production and production environments are adapted to this new implementation for over 4 months now and so far it is running fine with significant savings in cost without compromising the productivity or performance. Our non-production environments run on 100% Spot capacity and our production environment uses both Spot and On-demand capacity.