Lights Off, Spot On!

How Personio runs Kubernetes workloads on AWS EC2 Spot instances.

Johannes Brück
Inside Personio
8 min readFeb 18, 2021

--

Since 2018, Personio runs fully on the Amazon Web Services (AWS) cloud infrastructure, located in the Frankfurt region, while making heavy use of services provided by AWS, like VPC, EC2, S3, RDS, EKS etc.

Using these managed services allows us to focus on the important things, the things that help our customers, which includes:

  • Ensuring that we reach our high security and data protection standards.
  • Spending time on building features that our customers need.
  • Working on improving the performance and efficiency of our systems, instead of replacing broken hardware or fixing network issues.

Of course, everything comes at a cost. And, rightfully so, AWS charges us for using their services. While many services come with low or even no additional cost, there are a few that are responsible for the largest chunk of our monthly bill.

One of these services is, unsurprisingly, EC2. We are using it in conjunction with the EKS service, which allows us to deploy managed node groups for Kubernetes. These node groups are then deployed as EC2 AutoScaling groups.

Now, like every other company, we want to keep our costs low. Especially since Personio is growing quickly and as we add more and more load on our systems this also leads to an increase in capacity requirements and, finally, to higher EC2 costs.

Therefore, we were looking into options to keep our costs under control. One of these ways is to improve the efficiency of our applications and to reduce their resource footprint. On top of that, we also looked at EC2 Spot instances, as they provide significant saving opportunities of up to 90% (according to AWS).

Spot Primer

Spot Instances is a purchasing option for EC2. These instances are much cheaper because it is essentially unused capacity within AWS. Technically, they are equal to regular EC2 instances. They have the same CPU, memory, and disk specifications, thus providing exactly the same performance.

So why doesn’t everyone just use Spot instances? Well, there is one big downside: they can be interrupted and reclaimed with a notice period of only two minutes. Compared to a regular EC2 instance, which can run for months uninterrupted, Spot instances are much less reliable.

In case you are running an application that does not handle interruptions very well, or is even single-hosted, Spot instances are not a good choice.

So, why did we look into using them? It’s truly only about the money, and because we knew that most of our applications running on our Kubernetes cluster can handle restarts pretty well (as they are usually stateless).

Also, we are re-deploying applications frequently anyways and, in addition to that, we also autoscale our Kubernetes cluster. Therefore, the pods are used to being restarted somewhere else multiple times per day.

What are the expected savings when using Spot instances? Let’s take a c5.4xlarge as an example instance type. The regular on-demand price in Frankfurt is $0.776 per hour. With reserved instances, we pay ~70% of that, so $0.5432. The current Spot price for this instance type is $0.2706 per hour.

That is a 65% savings compared to on-demand instances and still, a 50% savings compared to reserved instances!

The Setup

Before we implemented Spot instances for our EKS clusters, they looked a little something like the below image. We deployed three managed node groups (one per AZ) and we use cluster autoscaler to scale them out and in.

As EKS did not natively support Spot instances, we had to deploy a separate autoscaling group with a configured Purchase Option of 100% Spot instances. To avoid capacity shortage for a single instance type, we chose instances of different families but with the same CPU and memory specs to not confuse cluster-autoscaler. In the userdata script we are using /etc/eks/bootstrap.sh to join the cluster manually.

Additionally, to support graceful shutdown during a rolling update of all cluster nodes, we use an ASG lifecycle hook with LifecycleTransition: autoscaling:EC2_INSTANCE_TERMINATING. Every time an instance is terminated, it sends a notification into an SQS queue. This queue is watched by a component called node-drainer, which then drains the respective node when the signal is received.

Resilience

As mentioned in the beginning of this article, Spot instances don’t come for free and are less reliable. Let’s have a look now at the different scenarios that can happen when using Spot instances and how to keep the system up and running.

Interruption of Spot instances

The most common situation is that a Spot instance is interrupted. There is an interruption notice via the EC2 instance metadata service or AWS EventBridge (formerly CloudWatch Events) which looks like this:

After receiving this notification, we have to take care of draining a Kubernetes node. For this, we have installed the aws-node-termination-handler. It watches for these notifications and cordons and drains the affected nodes in order to have some more time to terminate the pods compared to just relying on the SIGTERM when shutting down a node. This situation happens quite frequently as you can see in the graph below and the short loss of capacity can be compensated well in a short amount of time by the ASG.

Spot interruptions

No Spot Capacity Available

The second scenario that can happen with Spot instances is that there is a sudden surge of the Spot price and more than one Spot instance is terminated at the same time. Concurrently, instances are also not coming back because there is no capacity or the price may be higher than your bid price.

In this case, Spot ASG cannot scale out anymore or replace the lost instances. This is especially problematic as the instance to be terminated also needs to be drained and the pods need to move somewhere else.

Therefore, we need another source of capacity quickly. We cannot rely on cluster-autoscaler in this case, as it will only scale out once the Spot instance is gone and pods are already waiting to be scheduled.

To close this gap, we have created a Lambda function that listens on the Spot interruption notice in EventBridge and immediately scales up the on-demand ASG in the same availability zone as the terminated Spot instance. This allows us to quickly get on-demand capacity into the cluster when a Spot node is terminated.

If everything goes well, and there is another Spot node being added as replacement, then the on-demand node will not be needed anymore and can terminate again.

In case there is no replacement Spot capacity available, the on-demand node will stay and compensate for the loss of Spot capacity, with the trade-off of a higher price of course.

In the end, the AWS infrastructure looks like this:

Optimizing Spot Usage

Now that we have a Spot ASG, how do we tell our cluster to prefer this cheaper capacity?

Cluster Autoscaler Priority Expander

We first need to ensure that there is enough Spot capacity in the cluster to be used.

Cluster-autoscaler is taking care of adding nodes when needed so this component needs to be configured to prefer scaling up the Spot ASG.

This can be done using the priority expander with the following configuration:

The highest priority wins. In this case, it is our ASG with the name eks-spot. If this does not exist for some reason, or it is not able to scale, the second option will be used. This is the dynamically named ASG of the managed on-demand node group.

Tainting the On-Demand Nodes

What if there is free capacity on the on-demand nodes? In this case nothing is preventing Kubernetes from scheduling a pod there.

This has a downside: it fills up on-demand nodes and the chances are lower now that cluster-autoscaler will scale in one of the on-demand ASGs.

In order to prevent scheduling pods on on-demand nodes when there is free Spot capacity we can taint our on-demand nodes with --register-with-taints=lifecycle=ondemand:PreferNoSchedule
We are setting this taint on boot as --kubelet-extra-args parameter to the /etc/eks/bootstrap.sh

Now all the pods that don’t tolerate the lifecycle=ondemand taint will preferably be scheduled on Spot nodes.

Install the Spot-rescheduler

Lastly, what about already-scheduled pods? To move even more workloads to Spot nodes, we use the spot-rescheduler project. This component is actively looking for pods to be moved to Spot nodes based on the lifecycle label of the nodes.

In case it finds a new home for all pods of an on-demand node it will drain this node so that pods are moving to Spot nodes because of the Taint on the on-demand nodes.

Unfortunately, this project is unmaintained now and lacks significant support for Kubernetes > 1.15. Since we upgraded to Kubernetes 1.18 some weeks ago, we are actively looking into a replacement solution.

Conclusion and Outlook

By implementing a setup based on a custom EKS nodegroup, using a combination of Spot usage and Reserved Instances for the On-Demand nodes, we cut our EC2 cost by a factor of four.

We have implemented resiliency in order to handle certain situations like Spot interruptions, price surges, etc. Using aws-node-termination-handler and a custom Lambda function.

Furthermore, we optimized our Spot usage by preferring spot capacity with cluster-autoscaler priority expander, tainting the on-demand nodes and the spot-rescheduler.

To improve the setup further and reduce the operational efforts we will look into using the managed EKS node groups with Spot support which was released recently, but also moving workloads to Fargate to have more efficient pay per use billing.

In case you are excited to work on challenges like these, check out our open positions and join us at Personio — the most valuable HR tech company in Europe!

--

--