Cost-Savings On Databricks: Part II

Matt Weingarten
3 min readMay 26, 2023

--

Thick as a brick

Introduction

In a previous post, I mentioned some useful tips for saving costs in Databricks. I’ve been working with quite a few teams in our organization on optimizing these costs wherever possible, and thought it’d be useful to build upon those points as well as add some more.

Autoscaling

When I wrote about cost-savings in EMR in the past, I mentioned the usage of autoscaling, so that you’d only be paying for what you needed at a specific point in time. While that’s certainly a reasonable approach, autoscaling has its drawbacks. Depending on your application, autoscaling could occur frequently throughout processing. Autoscaling takes time to complete, and in some cases, it won’t even work (especially when you consider the rise of Spot prices as of late) completely.

You’ll be able to see how many times your cluster scales by looking at the event log for a job run. Unless you’re running on a single-node cluster, chances are this occurs a few times. For most batch processing, it’s preferable to actually forgot autoscaling, and just stick with a fixed number of instances throughout the job run. Your application won’t be wasting time trying to acquire/remove instances, and can instead just focus on processing. What you might additionally pay in more instances is generally saved by shorter job runs.

How do you know how many instances to use? That’s where the next section will be helpful.

Ganglia

The benefit of autoscaling is that you don’t have to figure out how many instances to use, which we all know can be the big data equivalent of searching for a needle in a haystack. Databricks has Ganglia metrics built-in (for clusters that aren’t using containers), so that you can use those metrics to figure out how your cluster’s behaving. Of course, figuring out what these numbers are trying to say can be a challenge in of itself, but you really just want to focus on the memory/CPU and find a sweet spot where it’s not too low but at the same time, not too high either.

Perhaps the easiest way to find that ideal configuration is by playing around with a few different combinations, checking the run time and Ganglia metrics afterwards to see how it performed. Once you’ve arrived at a good mix of performance and run time (staying within SLAs), then you should be good to go. Of course, you’ll want to come back to this over time to see if increased data loads has any impact on overall performance.

Instance Fleets

The main advantage of instance fleets in EMR was getting higher availability than what you’d get from instance groups. Of course, you should be taking advantage of the auto availability zone option, so that your application chooses the best availability zone to run in. However, one of the current drawbacks in Databricks is that you can’t choose from multiple instance types for your driver or worker nodes like you could in EMR.

That has recently changed, with Databricks introducing fleet clusters for AWS. While it’s slightly different, you choose a general class of instance type for worker/driver, and then Databricks will locate the best one for you to use based on the price-capacity-optimized strategy (an improvement over EMR, which doesn’t support that yet). The one drawback is that fleet clusters currently don’t support Graviton instances, but that will soon be changing. Once that becomes a reality, I’ll be switching over to fleet clusters exclusively for highest-availability and most stability.

Conclusion

Using these concepts and what we’ve discussed previously, you can definitely find good ways to really get everything you can out of Databricks in an affordable manner. Happy savings!

--

--

Matt Weingarten
Matt Weingarten

Written by Matt Weingarten

Currently a Data Engineer at Samsara. Previously at Disney, Meta, and Nielsen. Bridge player and sports fan. Thoughts are my own.