Going Fully Spot In Production
Introduction
Spot instances are a great way to rack up savings when it comes to compute-based processing in the Cloud. I’ve talked about this in depth for EMR cost optimizations and have previously rolled out a balance of half Spot, half on-demand instances for our clusters that are created via Airflow. Can we do better and go all the way?
Changes
I saw that this was a legitimate possibility when looking at the Spot instance price history for the instance types we use on our clusters (mainly the R series as we’re generally focused on memory optimization). These prices have never come close to the on-demand price over the previous 3 months across all the availability zones in the us-east-1 region. So, why not make the whole cluster run on Spot?
The only change we needed to make for this to work was to modify our instance fleet configurations (which we do in YAML files that Airflow parses when launching clusters). Previously, I had been keeping the core nodes as running on-demand while keeping just the task nodes on Spot. I simply copied the necessary parts for task to run on Spot and brought it over to the core node configuration. Easy enough, let’s put it to the test!
Tweaks
Not all of our daily jobs run in our Dev and QA environments as there’s no reason to unless we’re actively testing something. Despite having run this new methodology without issue in those environments for a few days, I didn’t see the full scale of these changes until it hit production. One caveat I quickly saw was the impact of running all those jobs with regards to the EC2 service limit for spot instance requests. As some of these jobs are trying to make those requests at the same time, we would hit occasional errors of “The requested number of spot instances exceeds your limit.” After talking to AWS support and getting that limit increased, we were good to go.
Another change I made to hopefully reduce the number of times we see that error (although our limit is quite high now) is adding more instance types into the instance fleet configuration. Giving more options that forces EMR to look for other choices to keep the vCPU limit within its bounds will definitely help as well.
Impact
We’ve already put a good deal of cost optimizations in place for our EMR clusters, whether it be using the autotuner from Sync Computing to better figure out what the ideal cluster configuration should be for a job, using managed scaling when the cluster can be increased or decreased, and utilizing Graviton instances over more legacy-based instances. These have all contributed significantly to our overall compute costs. Going full Spot was a huge cost savings for our team as well. As these instances can be anywhere from 30%-50% cheaper than on-demand (which is generally what I’ve seen for the R series), that greatly reduces the overall spend, allowing us to save an additional $85k/year if Spot prices today were to hold constant (which they of course won’t but will assume for the sake of simplicity).
Conclusion
It definitely can seem like EMR cost optimizations are more of an art than a science. I have invested a good amount of time into this work, but considering it’s one of the core services we use in our day-to-day processing, it’s worth it. With a combination of Graviton, managed scaling, instance fleets for higher availability, and a full Spot implementation in place, we are much cheaper than where we were just a few months ago, while maintaining optimal performance and high availability.
I definitely want to see if there’s a way to fill clusters with on-demand nodes when the Spot price gets too high and instances do get terminated (which doesn’t happen often enough to worry about, but still could happen). I know the autospotter was once a solution to this (I put that in place at Nielsen years ago), but I’m wondering if AWS offers anything to do this on their end. Is managed scaling the answer or would something else, like the autospotter, be necessary under the hood? With that mechanism in place, we would essentially have the perfect configuration for how to use EMR, from a cost and performance standpoint.