EMR Series: How to leverage Spot Instances in Data Pipelines on AWS
by Akshay Tambe
At Integral Ad Science (IAS), we measure over 100 billion data events daily, giving our customers unmatched scale, coverage, and accuracy. We process this data with hundreds of big data processing and data science pipelines. As we’ve continued to scale globally, IAS migrated to a cloud-based infrastructure hosted on Amazon Web Services (AWS), resulting in cost savings and increased performance. One great strategy to control and reduce AWS costs is to leverage spot instances.
Spot Instances are spare EC2 instances in the AWS Cloud which are offered at up to 90% cost savings compared to on-demand instances. These cost savings come with a caveat: AWS can reclaim those instances with a two-minute warning. This creates some challenges when including those instances in production workflows, but this can be resolved with the right design considerations.
This blog highlights our experiences building cost-effective big data pipelines on AWS. We will cover:
- Spot Instances in EMR
- Spot Instance provisioning best practices and cost-saving tips
- Ensuring fault tolerance and monitoring while using Spot Instances
- Scenarios where it did not make sense to leverage Spot Instances
Spot Instances in EMR
The EMR Service provides tooling to leverage Spot Instances easily. Workloads using Spot Instances must be fault-tolerant and flexible. For more information about getting started with spot provisioning, see the official AWS Tutorials.
At IAS, we are provisioning EMR clusters programmatically using the CDK in our data pipelines. This helps us to create transient (disposable) EMR clusters with custom configurations on the fly.
Spot Instance Provisioning Best Practices
While using Spot Instance provisioning we will also need a way to handle spot-interruptions in our workflows. We can this achieve by:
- Using the right provisioning strategies
- Having fault-tolerance in the application
- Having monitoring and alert mechanisms
Using the Right Provisioning Strategy:
Since Spot Instances are excess capacity, it is not a guarantee that those are available when your data pipeline requests them. There are multiple Spot Instance provisioning strategies available in the spot best practices guide, which you can try out.
After doing multiple experiments, we came up with useful measures to deal with Spot Instances and cost-saving tips:
- Use Spot Fleet Instances instead of Single Spot Instance Type:
Using a single class instance type doesn’t provide more reliance as it can lead to instance shortages. Therefore, when you want to fulfill the capacity targets with spot instances, use fleet configuration. Using Spot Fleet, you can define up to five EC2 instance types (15 types per task instance fleet, if using the allocation strategy option) that widen your search and give higher availability. This will increase your chance for suitable spot capacity being available and will reduce the impact of a specific instance type being reclaimed.
Consider using AWS Spot Advisor to determine instance types with the least chance of interruption. Along with this tool, use the EC2 Instances Info tool to compare the configuration of instances required for workloads.
2. Mix On-Demand and Spot Instances:
To ensure more reliability in time-sensitive pipelines, we can have a mix of on-demand and spot instances. For Master nodes, make use of On-Demand instances only to avoid spot-interruptions that may take down your entire cluster.
Start with 100% On-demand when you develop the pipeline. Once you complete development, slowly reduce on-demand and increase spot instances.
- 70% On-Demand and 30% Spot
- 50% On-Demand and 50% Spot
- 40% On-Demand and 60% Spot
- and so on …
While using different combinations, observe total-runtime of EMR workflow and spot-interruptions in EMR Console event logs. Select the combination with the least interruptions and similar runtime as that of using on-demand.
3. Use Spot Block Instances for minimizing interruptions:
Spot Block allows you to request Amazon EC2 Spot instances and helps run your job continuously with a defined time-block (1 to 6 hours) during which the spot instances are designed not to be reclaimed by AWS. This strategy is ideal for data pipelines that take a finite time to complete.
In rare situations, Spot Blocks may be interrupted with a two-minute warning due to Amazon EC2 capacity needs but AWS does not charge for those terminated instances even if you used them.
Also, please note that Spot Block has a little variation of pricing model (typically 30% to 45% below On-Demand pricing) compared to Spot Fleet.
So, if you want to build a pipeline (which takes < 6 hours) while minimizing on spot-interruptions and have better reliability, go for the Spot Block Provisioning Approach. You can specify Block Duration in Spot Specification in EMR Configs.
4. Fallback Mechanism:
If you can’t find spot instances, make sure you have a fallback mechanism ready. One example is to make use of the Spot Provisioning strategy along with timeout thresholds and switching to On Demand.
We used a fallback mechanism of switching to On-Demand if we can’t find spot instances within 10 minutes.
Spot Block Example:
Spot Fleet Example:
5. EMR Auto/Managed Scaling with Task Nodes for long-running workloads:
If you have a long-running workload on EMR (running more than 2 hours), consider adding task nodes with spot capacity and enabling EMR auto/managed scaling feature.
Usage of Task Nodes makes your job flexible to spot-interruptions as your running jobs won’t fail, after interruptions. Also, as the scaling feature is enabled, your cluster is automatically resized with the best performance at the lowest possible cost. This feature is extremely responsive and reacts to changes in usage within less than a minute.
Ensuring Fault Tolerance and Monitoring while using Spot
At IAS, we went with Airflow as an orchestration tool for our big data workloads since it has many advantages to ensure reliability and fault tolerance. There are several ways where you can achieve fault-tolerance:
- Restart the execution run in case of spot-interruptions:
With Airflow’s retry capabilities, we restart the execution run with the creation of a new EMR cluster with the same configuration/switching to on-demand. This strategy worked for us for small execution time workloads.
2. Usage of Spot Task Nodes:
Task Nodes do data processing but don’t hold persistent data in HDFS. Hence, in case of spot-interruptions, no data is lost, and the effect on your cluster is minimal. The EMR Cluster recovers by automatically adding new task nodes.
For some data pipelines, interruptions can be costly as your pipeline might have to restart from the beginning step, resulting in a waste of resources as steps before the point of failure are re-run. To avoid such redundant data-processing costs, we can implement checkpointing behavior within the application. One example of implementing checkpointing is to save the progress of steps externally on any data storage (say, S3). So, if your pipeline is interrupted, it can be restarted from where it left off.
4. Monitoring and Alert Systems:
Plan for the case that you will have surprise idle EMR clusters. This will happen because of edge cases that were not considered — bugs, AWS outages, and similar scenarios.
It is essential to monitor these idle instances and terminate them as soon as possible since they contribute to unnecessary operating costs. Example: Say, we started an EMR Cluster with 100 “m5.xlarge” nodes costing $20/hour and after using it, we forgot to turn it off — we will incur a $20/hour operating cost until we turn it off.
To avoid such situations, there is a list of cloud-watch monitor metrics provided by EMR which can be used to monitor idle and under-used instances. One example is “isIdle” Metric which shows for how long the EMR cluster was in use. We used such metrics to auto-detect and terminate EMR Clusters saving some operating costs.
Adding to the above measures and tips that help to reduce AWS operating cost, these general guidelines on selecting EMR Instance configurations per application scenario type came in handy:
Scenarios where it did not make sense to leverage spot instances
- Zero Time Tolerance/Fault Tolerance:
If your data pipeline is time-critical (cannot even tolerate 2–3 minutes of delay) and has no fault tolerance, spot instances are not useful. In that case, use on-demand fleet configuration to ensure the target capacity is fulfilled. We identified a few time-critical pipelines where minutes of delay can cause more delays downstream and hence, decided to run those workloads with on-demand capacity.
2. Searching for higher number of spot instances:
We tried running an experiment where we were trying to find around 5000 spot instances. In this experiment, we experienced more reclaim interruptions by AWS than when working with smaller workloads (<200 Instances). Besides, it took on-average longer to provision the cluster.
3. Managed Scaling with Spot for permanent clusters:
We have few permanent EMR clusters which are used for performing ad hoc analysis.
Managed scaling automatically scales-up/down based on job requirements. Since this cluster is used to submit queries at any random time, the cluster always had to scale up while searching for spot instances which resulted in a longer execution time. Once that job is finished, managed-scaling immediately scales down the cluster causing further longer execution times every time we submit new jobs.
We experimented with using On-Demand only v/s the Spot Provisioning strategy on workloads that took less than 2 hours to complete. To ensure our above experiments won’t fail we:
- added a Spot Block search duration for 10 minutes with a fallback mechanism on switching to On-Demand if target capacity is not fulfilled.
- used a mixture of On-Demand Fleet (50%) and Spot Block Fleet (50%).
- added idle EMR monitoring.
After running these experiments over a month, we observed around ~47% total savings as compared to On-Demand.
Spot Instances can be a great strategy to reduce the costs of your AWS data pipelines. To not compromise on speed and stability you will need to consider fault-tolerant designs. We have provided ideas on implementing the right provisioning strategy and how to handle spot instances being reclaimed. We also stressed monitoring your EMR clusters to avoid paying for idle capacity.
While Spot Instances are a great strategy for reducing costs, there are scenarios where our team decided the savings came with too much risk. For example, highly time-sensitive pipelines, provisioning very large clusters, and managed-scaling for permanent EMR.
Did you like to read this? Don’t forget to like and share. Stay tuned in @ias-tech-blog for more articles on data engineering!