EMR Cost Optimization

Matt Weingarten
3 min readApr 25, 2022

--

Don’t break the bank!

Introduction

Every service in AWS has various ways in which you can save some money here and there. In data engineering, S3 (which we already discussed with retention rules previously) and EMR represent two of the biggest services, and therefore, finding ways to optimize their cost are very impactful.

Spot Nodes

Unless otherwise specified, EMR nodes are on-demand, which means that you pay a fixed price for them depending on the region and instance type. However, there is also the concept of spot nodes. Spot nodes represent AWS’s unused compute, that they are willing to auction off for generally less than the on-demand price just so that it gets used. Therefore, using spot nodes presents a great savings opportunity.

The one issue that might come up with spot nodes is that if the price rises above whatever the max you’re willing to pay for it(which is recommended to just be the on-demand price), then the node will get destroyed, which could lead to data loss. Generally, this doesn’t present itself as an issue with the instance types used in big data, but it’s not ideal nonetheless.

Regarding spot nodes, it’s best to use them with core and task nodes (task nodes allowing for no data storage in HDFS). The master node should remain as being on-demand because losing it during the run of an application would be catastrophic.

Autoscaling

Autoscaling allows you cluster to increase or decrease in size depending on how much memory your cluster is actually utilizing. If it’s low, nodes can be added; if it’s higher than needed, nodes can be shed. Previously, setting up autoscaling was complex because you needed to specify all sorts of rules for scaling up and down (which certainly didn’t look nice when utilizing IaC).

Managed scaling is a much easier form of autoscaling to use. You don’t need to specify nearly as many options when setting up the clusters, which I would have to guess is where the name comes from.

For handling spot node interruptions, autospotting is an option. With autospotting, you can more safely handling those edge cases by putting the spot nodes behind an autoscaling group, so that a new instance is ready to go as soon as the previous one is terminated. I remember implementing this a few years back and it worked for our use case, but not sure how relevant it still is in the EMR picture.

Instance Fleets

The one disadvantage of IaC with instance groups is that your availability zone is fixed, meaning that if you have multiple clusters running in the same subnet, you might run into an outage of a certain instance type. Instance fleets allow you to randomly select from a set of availability zones, which results in being less likely to encounter an issue during cluster creation. Setting up an instance fleet vs. an instance group is not much of a change during automation after tweaking a few parameters.

Serverless EMR

I will definitely write about this more when it’s no longer in preview, but the option of serverless EMR is super appealing. Not having to go through cluster configuration and the ability to scale on-demand is such a time and money save over the current existing options with regards to EMR. More to come here, hopefully.

Conclusion

The list above is just some of the various ways in which you bring down your EMR costs. I’ve also talked about the autotuner with Sync Computing before, which is another great way to help control your cluster configuration. But the core of what autotuner can offer also can be done from using the options mentioned here. Happy saving!

--

--

Matt Weingarten

Currently a Data Engineer at Samsara. Previously at Disney, Meta, and Nielsen. Bridge player and sports fan. Thoughts are my own.