Cost-Savings On Databricks (And Other Tips)

3 min readJan 4, 2023

Introduction

Is it time for another Databricks-focused post? Anyways, I’ve come across some various new best practices (including a few centered around cost-savings), and wanted to share those so others could take advantage of them if they weren’t already.

SLAs

When all of our jobs were previously run through Airflow, we discussed how we used the sla argument to track the runtimes of our tasks/DAGs. One caveat we’ve encountered with moving more of our jobs to Databricks is that there’s no way to track SLA in the Databricks UI. Sure, we can see the matrix and breakdown of each job/task runtime, but there’s no built-in alerting if that SLA is actually exceeded for a particular job run.

The way we’ve decided to handle this, and the only way that I see is feasible at this point in time, is to use Airflow to call the Databricks job in question and have it be orchestrated within Airflow rather than Databricks itself (meaning that the schedule is turned off in Databricks and handled through Airflow). That way, you can still use the sla argument to track the runtime for proper alerting if that’s exceeded. If you think this is a bit convoluted, I don’t disagree, and I think it’s only worth doing for those jobs you feel need an SLA rather than doing it for every job. It’d be nice to see this capability supported in a future release of the Databricks platform.

Notifications

In addition to SLAs, it’s always useful to get notifications for job failures, so that you don’t always have to be monitoring in Databricks to see the progress of a job. This is rather easy with the notifications feature, which allows you to specify either an email or some other system endpoint (Slack or a generic webhook) for notifications about job failure/completion. Currently, we’re unable to take advantage of the system notifications (as they need to enabled at the workspace level), so we’re using a PagerDuty email address to forward our notifications to Slack (as all our PagerDuty notifications were already coming into a Slack channel).

Of course, it’s a bit redundant to have Databricks notifications if you already have them enabled in Airflow through the on_failure_callback. You’d ideally only want them in one place (assuming your DAG is just a call to Databricks and nothing else) to avoid too many alerts.

Optimizing Dependencies

Similar to Airflow, there’s no restrictions in Databricks for tasks to be linear, so definitely take advantage of parallelism where possible and arrange your tasks accordingly. For best performance, you may only want to use more than one shared job cluster so that some tasks get the enhancement of not having to share their compute with all the other tasks running at the same time.

Cost-Saving Tips

It’s pretty easy to run up an expensive bill in Databricks if you’re not careful. Here’s some ways to save on cost:

Job clusters: Using job clusters where possible will ensure that those resources are only alive as long as they need to be. You won’t be paying for compute you’re not using in that case.
Graviton instances: For those whose Databricks is running on AWS, using Graviton-enabled clusters will give you the added bonus of using cost-optimized EC2 instances. The performance will most likely stay intact if not improve.
gp3 EBS: Depending on how much control you have over your Databricks clusters, you’ll want to be taking advantage of gp3 EBS volumes rather than gp2. This can enabled in the workspace settings for SSD storage. Once it’s switched on, all clusters will be taking advantage of optimized EBS volumes for even lower costs.

Conclusion

I’m always refining my best practices in Databricks considering how much work I do there on a regular basis. I look forward to likely putting out another post (and maybe it’s time to summarize it all) soon enough.