Allocation Strategies In EMR Instance Fleets

4 min readJan 23, 2023

EMR clusters. Star clusters. Same thing.

Introduction

I’ve previously written about the merits of instance fleets with EMR clusters. Instance groups are limited in supporting more than one instance type and availability zone per node type, so instance fleets give you more breathing room and ensure your clusters are more likely to be available in the long run.

Under the surface, though, how does an instance fleet put it all together, and is there still any room for improvement in this approach?

On-Demand Instances

For reference, you’ll be able to find a lot of what I’m talking about here in the official AWS documentation.

If you’re choosing to go with on-demand instances, you’re not risking the potential outages that can come along with using a cheaper option, such as Spot. So when specifying multiple instance types in an on-demand instance fleet, you want the fleet to choose the cheapest one that is actually available, since it’s ensured to never go away. Therefore, on-demand instances use a lowest-price allocation strategy, so that your costs remain minimal while choosing the more expensive on-demand option.

Generally, you want to use an on-demand instance with the master node, as you can’t afford for a Spark driver to be lost while processing is running (no recovering from that). For production workloads that are mission-critical, it’s generally recommended to use on-demand instances as well, so a lowest-price allocation strategy if using instance fleets will at least keep the costs as low as possible.

Spot Instances

If you’re choosing to use Spot instances, you’ve accepted the risk that your instance may disappear at any time, but you still want to have as highly-available a cluster as possible. As a result, when specifying multiple instance types in a Spot instance fleet, the fleet will focus on the instance type that is most likely to remain available. This is referred to as a capacity-optimized allocation strategy.

You can see the outage rate of various EC2 instance types in the Spot instance advisor. To say the obvious, using less-common instance types helps prevent the likelihood of using instances (avoid the small and medium sizes and avoid the most-common instance families (t and m series)).

Why Not Both?

As I’m sure can be discerned from my recent posts, I’m on a FinOps mission this year. I’ve done this in multiple ways with EMR clusters, and one way I’ve tackled this is with Graviton instances. When I first implemented instance fleets for us some time back, I found that the capacity for Graviton instances was not as high as normal instance types (later confirmed with AWS support). This has been since been improved, but as a result, I’ve generally included the corresponding legacy instance type along with the Graviton instance type (r6g.2xlarge and r5.2xlarge as an example) in my instance fleet configuration as a safeguard.

As I’m using Spot instance fleets for the majority of my core and task nodes (yes, even in production!), they were using the cost-optimized allocation strategy. Legacy instance types are almost always more available than their Graviton equivalent, which means that my Graviton instances weren’t actually being used, even if they’re both available (I found in CostExplorer that they were only being used in on-demand configurations since those use the lowest-price strategy, and Graviton is always cheaper than its legacy equivalent). Therefore, I wasn’t actually reaping the benefits of my work.

I’ve since fixed this by removing the legacy instance types where feasible in the instance fleet configurations. So far, I’ve only done this with the smaller configs, as I’ll test with the larger configurations first before making sweeping changes with those (no guarantees with Graviton). All in all, though, it’s a start.

AWS has actually released a price-capacity-optimized allocation strategy in the last few months that will diversify amongst multiple low-priced Spot instance types that are all available. They’ve even recommended this for Spot workloads, but it doesn’t yet appear to be available for EMR (c’mon AWS!). Once it is, I’ll definitely switch (as everyone should), but until then, I’m hampered by the capacity-optimized allocation strategy.

Conclusion

Right now, the only possible allocation strategies for EMR instance fleets are lowest-price (for on-demand instances) and capacity-optimized (for Spot instances). The price-capacity-optimized strategy is only available for EC2 so far and no applications that are dependent on EC2, such as EMR. Once that does become available, it’s recommended to switch to get the best of both works to keep your jobs available AND cheap.