AWS EC2 Spot useful Tips

Amol Kokje
The Startup
Published in
6 min readSep 25, 2019
AWS EC2 Spot useful Tips

EC2 Spot is a feature in AWS that is widely popular in the community right now, because if used right way, you can get huge cost savings without compromising on the scalability and durability of your cloud deployment. There are many blogs already out there which deep dive on how to use the feature, so I will not cover that here. In this post, I will focus on summarizing and answering the most common questions users(including me!) have when they hear about Spot. I will leverage some AWS blogs to point to some great examples out there to use as reference.

Let me start with clarifying the common terms that come up in every discussion.

  • Spot Instances: Unused EC2 instances that are available for lower cost than On-demand instances.
  • Spot Fleet: Allows you to launch multiple spot and on-demand instances in a single-request to meet with target capacity in instances or in custom units. You can use the Automatic Scaling to scale out/in in response to scaling policies. When using this, you need to be aware of the instance pool availability in your AZ, adjust bidding price, etc. to ensure you have instance capacity to service your workloads.
  • EC2 Fleet: Enables to launch fleets of On-demand/Reserved/Spot instances in a single request. As opposed to Spot fleet, you don’t need any custom code that you need to manage bids and capacity. The fleet uses specified target on-demand/spot capacity to scale. There is no automatic scaling yet, but AWS has plans to integrate with EC2 Auto Scaling Groups.

Spot instances are very cheap, but there are two objections from most developers — Spot termination, Spot availability. But, these are not a limitation since thats what makes them so cheap, there are many ways you can work around this. Applications that use Spot should be loosely coupled, and the services running in the instances should be responsible for relatively small operations that don’t need to maintain the state information for a very long time. Below are some great use cases.

Use Spot Fleet with ELB:

As you know, when using an ELB(Elastic Load Balancer) with EC2, the instances register and deregister with the ELB based on health check status, which is what the ELB uses to route the traffic. But, when using EC2 spot, whenever there is a termination notice, the instances do not de-register automatically, and hence some traffic will be dropped if this scenario is not handled properly.

The way to go here is that you use the interruption notice as a trigger to de-register your instance, so there is no more traffic routed to it.

To achieve that, you need a Lambda function that is triggered on the CloudWatch instance termination event. This job of this Lambda function is to get the instance ID from the event and de-register the instance. AWS blog talks about this technique in more detail.

AWS EC2 Spot: Using Spot with ELB. Courtesy: AWS blog

Build scalable and highly-available applications:

Are spot instances only meant for short-lived tasks? The answer is No. You can also use the Spot instances for large-scale applications. Of course, ideally you should not store persistent data and use a reliable backup mechanism that can maintain the state. You can definitely use spot to scale out at peak traffic conditions.

Since spot instances terminate, how do we ensure that the application is highly available? AWS blog talks about a unique and proven methodology used by Appnext for its production deployment, where the base capacity is fulfilled by always-running On-demand reserved capacity(to save costs :-)), and Spot instances are used to service peak workloads based on CloudWatch metrics. If Spot Fleet is unable to fulfill the target capacity and scale out due to lack of capacity in the selected Spot capacity pools, On-demand instances are started instead. For example, see below:

  • Spot fleet: This is the main supplier of Spot instances that are automatically joined as targets behind the ELB on launch. The Spot fleet is configured with an automatic scaling policy. It only adds Spot instances if the average CPU utilization of the On-demand instances passes the 75% threshold for 10 minutes. It terminates instances if the CPU utilization goes below a 55% CPU threshold.
  • Auto Scaling group for On-Demand instances: This is the secondary supplier of EC2 instances for the ELB. The auto scaling group is configured with 1–2 static On-demand instances. It scales out when the average CPU utilization of the instances in the auto scaling group passes the 90% threshold for 10 minutes. It terminates instances when the threshold goes below 75%.
AWS EC2 Spot: Auto Scaling. Courtesy: AWS blog

Loosely-couple application components:

What if the instance terminates when its executing a workflow component? There are various ways to handle this case. AWS blog has a great batch processing example, where it talks about how to loosely couple your application components.

Say, you have a pipeline with requests coming in. The request meta-data can be queued in SQS, with the data in S3, and DynamoDB to persist the state of request which may be ‘Pending’ to start with. A spot fleet can be used to service the requests by pulling from the queue. If the spot worker finishes servicing the job, then it can mark it ‘Done’ in DynamoDB, and move data out of the S3 bucket. If the instance terminates before the job is complete, then it can add the job back in SQS queue and not update the state in DynamoDB. This will ensure that all jobs in the queue are processed, and spot interruption is taken care of.

AWS EC2 Spot: Batch Processing Example. Courtesy: AWS blog

How to persist state across spot instances?

How will your application/service running in a Spot instance know about the instance termination notice? This is important since if the app/service needs to take some action — say save the state, or upload data backup, etc. then it can utilize the two minute time window effectively. There are multiple ways to use that notification:

  • EC2 will send a notification to CloudWatch events, which can be used to trigger a lambda function that can take backups and make necessary environment changes like re-assigning Elastic IP, deregistering from ELB, S3 data backup, update Route53 endpoint, etc. and any other action of your choice.
  • On the instance, you can monitor if there is a termination notice by monitoring the meta-data located at “http://169.254.169.254/latest/meta-data/spot/termination-time” using a thread. This thread can raise an event to invoke callbacks to take the appropriate action. A simple example in bash shell can be:
#!/bin/bash
while true
do
if curl -s http://169.254.169.254/latest/meta-data/spot/termination-time | grep -q .*T.*Z; then /env/bin/runterminationscripts.sh;
else
# Spot instance not yet marked for termination.
sleep 5
fi
done

Besides the applications described here, there are many other use cases for Spot like Kubernetes, EMR, etc. These examples just help to zoom-in on the common concerns and their solutions. Personally, I don’t see benefit of using Spot instance by itself, but using with Spot or EC2 fleet is what makes it so useful! If you find any more great tricks and tips with using spot in native AWS, I would love to know more!

--

--

Amol Kokje
The Startup

Look forward to waking up every day to an interesting challenge!