Databricks Cost-Saving Tips

Matt Weingarten
4 min readJan 9, 2024

--

Me paying my Databricks bill

Note: This is a post written in collaboration with the people at Hevo Data. Definitely check out their platform if you’re interested!

Introduction

In my first post of this series with Hevo, I focused on the role that data engineering plays in the FinOps journey. I started by examining various ways to bring down costs in AWS, and so it’s only natural to move to a different DE tool that’s in heavy usage: Databricks.

A lot of the below points will reference previous posts (what can I say, I like Databricks!), but they’re definitely worth revisiting and updating as necessary.

Job Clusters

A key distinction that needs to be made early on in your Databricks journey is the difference between interactive and job compute. Interactive clusters are what you’ll use by default when running a notebook and performing ad-hoc analysis. They’re meant to be shared and should be easy to be spun up, so that anyone can use them as needed.

Job clusters are meant for jobs within Databricks (I guess the name is pretty clear). These clusters aren’t used for anything else and are only in use for the duration of the job run. By isolating the compute for these clusters to just those processes, you can ensure that your jobs are getting the full power of your cluster, which helps minimize the nodes needed (more on that later).

To quickly summarize this discussion:

  • Use job clusters for any recurring workflows
  • Use interactive clusters for everything else

Compute Policies

You can achieve a lot of wins for your organization through a proper implementation of compute policies. These policies are meant to act as a safeguard on what users can do with their compute in Databricks, which would be unlimited by default.

What should you be focusing on in these policies? Some ideas include:

  • Autotermination time: Interactive clusters only terminate after a certain amount of inactivity. You want to keep this on the lower side (but not too low, or then it’s impossible for people to use them effectively) so you’re not having things run 24/7 (this is one of the other advantages of job clusters; they terminate as soon as the job completes). I’ve found 30–45 minutes to usually be a reasonable timeframe.
  • Instance types: Having a restriction on which instance types can be used helps ensure that clusters aren’t running with unnecessarily large instances (teams can always provide justification on why they need larger instance types if their have a proper use case). You can also use these restrictions to keep your instances to the latest families, such as Graviton.
  • Number of workers: Chances are that you don’t want an interactive cluster to be running with 300 workers. Having a max on the amount of workers helps keep this in check. You should also have a max on the amount of on-demand instances that can be in use, so that you’re taking advantage of Spot compute.

This is only the tip of the iceberg when it comes to having effective compute policies in place. We also use it for enforcing tags (necessary for cost-tracking) and setting availability zones (so you’re taking advantage of all of them). We even go a step further and have different policies for interactive clusters and job clusters, since they’re generally used for different purposes. All in all, these are a great way to help stifle spend you may not be able to sniff out otherwise.

Miscellaneous

Since I mentioned Graviton, it’s definitely worth touching on it in more detail. A previous post of mine discussed the migration process to Graviton, which is pretty simple but has a few restrictions (can’t use the ML runtime or Docker images). I’d also mention that Graviton availability overall can be difficult, so expect a few bumps along the way if you have a wider-scale operation in place.

Availability is where fleet clusters can help. By having a wider range of instance types to choose from, you won’t be stuck when a certain instance type is maxed out. Databricks still has to introduce a few more fleets into play (which is on their roadmap), but it’s been a relatively stable feature for us so far.

Databricks has some great ways to monitor your clusters to see if resources are being utilized reasonably. Event logs can alert you to all the autoscaling events within a cluster (if you’re scaling a lot, it might be better to find an absolute number of nodes to use and stick with it), while the Ganglia metrics tab can show you the overall memory and CPU usage for a cluster. These will serve as a good baseline for analyzing further optimization opportunities for your clusters.

Conclusion

Databricks is a great platform, but if it’s not used effectively, costs can get out of hand quickly. Taking some of the practices above should help get everything in a more stable state. Remember: it’s a marathon, not a race.

Another point worth noting is that new features are released in Databricks all the time, and it’s worth staying up to date to see what’s new. In 2024, for example, we will finally see the arrival of Severless job compute, which could be a gamechanger for how teams are running their jobs.

Since we’ve focused so much on tagging in this series (as we should, since it’s a critical component of proper cost-tracking), it’s worth noting that you can apply similar tagging constructs in Databricks. Take advantage of this and tag all of your clusters following the same standards you had for everything in AWS.

--

--

Matt Weingarten

Currently a Data Engineer at Samsara. Previously at Disney, Meta, and Nielsen. Bridge player and sports fan. Thoughts are my own.