Best Practices For Saving On AWS Costs

Matt Weingarten
4 min readJan 5, 2023

--

Mo money, less problems?

Introduction

I’ve mentioned before that I’m a lurker on Reddit’s wonderful data engineering community. I usually just read, but there was a thread yesterday that I couldn’t resist commenting on, dealing with cutting Cloud costs in data engineering work. I think by now, readers will know that’s an area I’m very passionate about.

I’ve written a lot in the past about saving costs on EMR, but I realized I hadn’t really dived into all the different services on AWS. Therefore, I’d like to spend some time to do that (using points I’m already working on for an internal KT session — communication rules!).

S3

By default, S3 storage maintains all objects in Standard storage, which doesn’t look like much on the surface but quickly accumulates as data grows. Therefore, you’ll definitely want to have lifecycle policies in place for transitioning data into cheaper forms of storage (such as IA or even Glacier if it’s considered “cold” enough). You might even be able to remove data over time in your lower environments (this is often a concern in production, especially with data regulations), which will have even more savings than transitioning.

In addition to lifecycle policies, you’ll want to enable bucket keys for your KMS-encrpyred buckets. This offers reduced costs to KMS, which can also go a long way with expanding data sizes.

EMR

As I’ve written about EMR cost optimizations in detail before, I’ll quickly summarize some of the points I’ve had in those previous posts:

  • Spot nodes: Yes, Spot isn’t a 100% guarantee to be perfect, but the reliability is generally high enough where the risk is minimal to use, even in production environments. The cost savings are definitely worth it.
  • Graviton processors: Graviton 2 (and Graviton 3 eventually) offers significantly stronger compute performance than the legacy EC2 instances. You’ll want to use these wherever possible in your clusters.
  • Managed scaling: Autoscaling makes sure you’re not overusing resources when not needed. This can sometimes be a bit finicky if you’re not experienced with setting up appropriate scaling policies, so use at your discretion.
  • Instance fleets: Instance fleets offer more reliability than instance groups as you’re not fixed to a specific availability zone when trying to launch a cluster. This helps ensure you don’t hit an out of capacity error since it can look through all the availability zones before launching. Higher reliability equates to less costs over time.
  • Ephemeral clusters: The easiest way to drive up Cloud costs is failing to turn off resources you no longer need. With EMR, this can get expensive very quickly. Therefore, make sure that your clusters are only active as long as they’re needed. Use automation to turn off whatever’s unused to avoid paying a heftier bill than normal.
  • gp3 EBS volumes: This was actually one I just recently figured out. gp3 EBS volumes offer very good cost savings over the legacy gp2 volumes, so it’s recommended to switch over and that process should be relatively seamless (it’s possible the performance is actually worse if the EBS volume is large enough).
  • Serverless EMR: Serverless EMR is still in its infancy but I expect it to take more hold in this coming year. We’ll definitely be comparing this to our Databricks costs to see what makes more sense going forward.

Miscellaneous

There’s too many services in AWS to focus on in a post like this, so I decided to summarize the best practices I’ve seen for the ones we use on a regular basis:

  • ARM architecture: In addition to EC2/EMR, you can extend Graviton savings to Lambda as well as Fargate. If you’re using Docker images to power these services, make sure that your Docker can support multiple architectures and not just x86_64, which is the default.
  • Fargate Spot: We all know Spot instances are a good thing, so we can do the same exact thing with Fargate and achieve 70% cost savings as opposed to on-demand costs. To ensure reliability, you can have a cluster allocation strategy in place that only puts a certain ratio of tasks on Spot. Note that Fargate Spot and Graviton are not yet compatible with each other, so forget about Fargate Graviton for now if you’re going to go with Spot tasks.
  • Glue partition index: I’ve talked about Glue partition indexes previously, but you’ll definitely want to have these on place on larger tables to decrease your compute costs. It does seem like there’s a breaking point with how useful these are, so it might be more useful to dive into Delta Lake or some other data format that can more easily handle the big queries than something like normal Parquet.

Tips

Hopefully, the above provides a good overview of some practices you can put into place on key services if you haven’t already. How can you bring it all together?

  • Common tagging standard: To group everything together in a proper dashboard, you’ll want to follow a common tagging standard for all your resources so it’s easy to discover all your infrastructure. I’d recommend using tags such as environment, name, team name, etc. to track everything in the ecosystem.
  • Establish budgets and alerts: Make sure you’re tracking your costs using a properly-established budget. You’ll also want to have alerts in place so that you’re notified whenever your costs exceed the threshold you’ve identified.
  • Refine, refine, refine: AWS and Cloud providers push out new features all the time that help save costs for key services. Stay in the know on what’s new and make sure to integrate non-breaking changes whenever you can if they’re going to make some impact on your costs.

Conclusion

FinOps is firmly at the top of everyone’s list of “we needed to get this done years ago but we’re doing it now.” Following some practices like the above should be a quick and hopefully easy way to see some savings right away.

--

--

Matt Weingarten

Currently a Data Engineer at Samsara. Previously at Disney, Meta, and Nielsen. Bridge player and sports fan. Thoughts are my own.