Databricks Cost Reduction Cheat Sheet

No, Databricks is not super expensive

Florent Moiny
4 min readJan 22, 2023
Photo by Jp Valery on Unsplash

Here is a simple and straight-to-the-point 101 cheat sheet to dramatically increase your ROI on Databricks.

Streaming

Do you need 24/7 streams?

Does the business need 24/7 streams? Real-time data ingestion? Near real-time predictions?

Yes? Have your clusters run 24/7.

No? Don’t have your clusters run 24/7. How? Use incremental batch processing via trigger(once=True) or trigger(availableNow=True).

Databricks blog post: Running Streaming Jobs Once a Day For 10x Cost Savings.

Tune the node count and size properly

Especially for streaming clusters running 24/7. I’ve seen one customer running four streams with slightly more and bigger nodes than necessary. After two hours of basic tuning, I reduced the bill by $200k annually.

Databricks documentation: Best practices: Cluster configuration.

Don’t use autoscaling with standard streaming workloads

They will automatically scale to the maximum number of nodes and stay there for the duration of the job. Instead, read what’s next.

Use Delta Live Tables with enhanced autoscaling

To have proper autoscaling with your streams on Databricks, use DLT with the cluster mode set to enhanced autoscaling.

What’s your trigger interval?

Before DBR 8.0, the default trigger interval for streaming queries was 0 ms, which could quickly get very expensive when pinging the underlying cloud storage. DBR 8.0 set the default to 500 ms, which can still be costly in some cases.

Pick a trigger interval that makes sense from the business perspective.

NB: that’s another reason to use new DBR runtimes as fast as possible.

Autoloader: Be careful with directory listing

With many files in cloud storage, directory listing mode can be costly.

Recommendation: use directory listing when developing/testing the application or in production when files are generated with lexical ordering. Use file notification in other cases.

Clusters

All-purpose vs Job clusters

All-purpose compute costs way more compared to job compute. See pricing pages for AWS, Azure, and GCP.

Use all-purpose clusters when performing ad-hoc analysis, data exploration, or development. Use job clusters for all production jobs.

Use aggressive auto-termination with all-purpose clusters

When using an all-purpose cluster, the default auto-termination is 120 minutes. Be more aggressive: 15 or 30 minutes should be fine. Note that the current minimum is 10 minutes.

Use aggressive auto-termination with Personal Compute

Personal Compute is great, but auto-termination is set at 72 hours. That doesn’t make any sense. Be aggressive and use 15–30 minutes.

Use Databricks SQL Serverless

The serverless version provides instant compute and is more cost-efficient.

Use cluster policies

Less of a low-hanging fruit, but provides guardrails. See the documentation.

Runtimes

Upgrade runtimes

Take advantage of newer features, bug fixes, and security patches. Some features will drastically improve your workloads, e.g., when switching to DBR 7.x and Spark 3 (and Adaptive Query Execution, see below). Most of the time, you don’t have any code changes to make.

At the very least, use the latest LTS version.

Use Photon

Using Photon raises the number of DBUs per hour but also accelerates workloads dramatically (most of the time), reducing overall your total cost.

Data

Use Delta

Just use Delta for your tables. It’s superior in pretty much all aspects.

Compact your tables

Having thousands and millions of small files can kill your performance (that famous small files problem). Use OPTIMIZE.

Use Z-ordering to improve data skipping

Less of a low-hanging fruit, but essential, especially for needle-in-the-haystack queries. Associated concept: partitioning.

Vacuum your tables

I’ve seen terabytes and terabytes of data that should have been vacuumed months ago. Don’t pay extra for something that you don’t need.

Use Low Shuffle Merge

Migrate to DBR 10.4+ (remember, use the latest DBR LTS version available) and enjoy Low Shuffle Merge being enabled by default.

Set the number of shuffle partitions to auto

Use Adaptive query execution (enabled by default since DBR 7.3 LTS) and use spark.conf.set(“spark.sql.shuffle.partitions”, “auto”). AQE will then handle the number of shuffle partitions at every pipeline step.

Cloud

Use spot instances

When it makes sense. If you’re stressed about VM availability and workload duration, you might not want to use them.

If you want more security at a potential extra cost, use the Spot fall back to On-demand feature.

Use reserved instances and instance pools

If you know what your VM consumption will look like in the next year(s), using reserved instances (AWS, Azure) in conjunction with instance pools can be interesting.

Pre-purchase Azure Databricks commit units

The same idea as reserved instances but applied to DBUs. See the documentation.

Be paranoid about networking

Be very careful about cross-region data transfer.

Use regional VPC endpoints on AWS or Private Link/Service endpoints on Azure.

Recommendation: if you don’t know what you’re doing when it comes to networking, or if you’re not sure, then 1. no shame in that, this is super hard, and 2. ask your infra team about all those magic keywords.

Further reading

Definitely read Databricks’ blog post Best practices for cost management on Databricks by Tomasz Bacewicz and Greg Wood. I shamelessly took some of their points.

--

--

Florent Moiny

Databricks Expert | Consultant | Teacher - Writing about Software Engineering, Data, and Working in Tech