Spot Instances in production, is it worth?

Published in

mercos-engineering

4 min readFeb 2, 2018

For a long time we at Mercos were struggling with CPUCredits from T2 Instances, but, using non-T2 instances makes our pockets suffer. So, we decided to explore the Spot world! Follow our experience below.

Briefing

Today at Mercos in production, we are using AWS Auto-Scaling groups based on AMIs generated by Packer. It makes our lives much easier because we can trust in our Auto-Scaling groups to perform scale-in/out actions, rolling-in new deployments, self-healing from failed instances, etc.

Based on this elasticity that AWS provides to us and on our system's memory consumption, we can use small instances to run Mercos, and on increased demand periods, our scale-Out takes control and launch as many as needed new instances to be able to run everything without struggle with lack of resources. At night, as expected because our system is mainly used on business hours, our traffic drops, and during almost 2/3 of the day we have a low CPU usage, so using expensive instances is also a monetary loss to us.

Given that scenario, we decided to use T2.Medium (Cheaper instances) instances expecting that setting the auto-scale CPU trigger to around 35% we wouldn't have to use CPUCredits. It didn't work as expected because every deploy that goes into production, new instances were launched and a warm-up period consumes CPUCredits. After a busy day we ran out CPUCredits and it makes we receive a throttling rule (image below) that limits our CPU to only 40%. With that, chaos comes up, slow response times and even unavailability. Also, it means that we can't use the whole instance capacity (burning money).

Image from T2.Medium instance been throttled by AWS (Busy Other means steal CPU)

So, to finish this long story, we left T2 instances away and started using C4.Large Spot Instances to proper use the CPU available and pay less than the respective on-demand instances, and, even less than T2 instances too.

So, what really are spot instances?

By the AWS documentation, essentially, Spot instances are spare compute capacity in the AWS cloud available to you at steep discounts compared to On-Demand prices.

Thereby, we saw that the discounts can be up to 90% of the on-demand price, what is a win-win for us. We can save money and use instances without CPU throttling.

Then we started to study more about Spot instances and a few conclusions could be made:

The price varies with the demand and the availability of spare hardware
We need to put a bid price for the instance and with the current price goes higher, we lost the instance
AWS could take the instance back if there are not enough unused EC2 instances to meet the demand
If we bid a high price, we could have to pay more than on-demand

What we did with all those facts

We started this journey by configuring ours Auto-Scaling groups to use C4.Large Spot Instances instead of T2.Medium on-demand instances, for this, we had to only configure on our Launch Configuration a bid price, then the Auto-Scaling group automatically launches SpotInstances. So far so good. However, there is no free lunch, as said before, we know that AWS could take the instance for us if the price goes up or there is a lack of on-demand instances (Blackfriday, holidays, etc.). For that we created a Lambda function to monitor the SpotInstance price and alert us if the price goes up than we are able to pay. CloudWatch metric graph below.

With this metric/alarm we decided to put a higher bid price to guarantee our instances and if the alarm triggers we can change our instances to on-demand before AWS take the instances back (no downtime). But, this project still in progress and the next action is to create a separated Auto-Scaling group with on-demand instances that can scale-out based on a CloudWatch metric that monitors our SpotInstance's group size. With that, we can prevent an unexpected AWS action that could take all our instances and at the same time we still keep saving money.

Conclusion

Simplifying it at the most as possible, we saved around 75% of our EC2 costs, only by using an AWS feature available to everyone, mainly of this saving appears on our API servers that are used 24/7 basically. Also, we had the opportunity to review our Auto-Scaling configurations, which gave us much more confidence in our environment. It also gave to us the ability to not use instance with CPU credits, what makes ours auto-scaling activities smoother and safer. So, spot instances for sure are a very good way to save money on your infrastructure.

Spot Instances in production, is it worth?

Briefing

So, what really are spot instances?

What we did with all those facts

Written by Cleber Benjamin Warmling