In Plain English: How Does Databricks Pricing Work?

Mendelsohn Neil Chan
7 min readAug 29, 2022

--

In Plain English is a series of articles that aim to breakdown technical ideas in simple language. Complex language may be great for academia, but it’s not suitable for the typical data practitioner who just needs to understand the concept and apply it in real life!

1. Let’s start with a practical analogy

Databricks’ pricing model is like paying for your electricity bill. You pay for the amount of energy that you consume.

This is a practical analogy I use to explain how Databricks’ consumption-based pricing model works in layman’s terms. When we use electricity for lighting, heating, and for operating our appliances, we consume energy that gets measured by a unit of measurement called the kilowatt-hour (kWh).

In a similar vein, when we use Databricks to run our ETL pipelines or train ML models, we consume computation power that gets measured by a Databricks Unit (or DBU for short).

A DBU is similar to kWh in that both are units of measurement for consumption based on time

2. Calculating cost

To calculate the cost of using Databricks, simply multiply the amount of DBUs that you consumed with a corresponding $ rate.

For example, if you subscribed to the Premium tier and deployed Databricks in AWS US-East, where you ran a data pipeline that consumed 100 DBUs, multiply this value by the rate of $0.15/DBU (Jobs Compute cluster type) to arrive at the cost of $15 (100 x 0.15).

There are a couple of considerations and nuances to bear in mind which I will explain later, but for now, just remember that the fundamental mechanics of the cost formula above don’t really change.

3. What about data storage and hardware?

So far, all of our discussions around pricing have been revolving around the compute-processing power needed to run data workloads on Databricks. But what about data storage and resources in the cloud?

The three main components that make up the cost of using Databricks

From a Total Cost of Ownership (TCO) perspective, yes you will still need to calculate the cost to store data in cloud storage (e.g. S3, ADLS) and the cost to provision compute IaaS resources (e.g. EC2, Azure VMs) with your Cloud Service Provider. But in terms of what you pay to Databricks, it is simply based on the compute time you use. It’s as simple as that.

Below is a detailed explanation of the two main components that make up the total cost of running Databricks:

1. Storage: Storage costs are paid directly to your Cloud Service Provider (e.g. AWS S3). Because of Databricks’ unique Lakehouse architecture, you don’t have to duplicate your data and pay for storage costs twice (most other platforms require you to dump data in S3 first and then load it into their data warehouse which incurs duplicate storage costs).

In a nutshell, think of Databricks as a “layer of abstraction” that sits on top of cloud object storage. You interact with your data using a front-end web application that offers you a Notebook and SQL Editor UI to choose from, but under the hood it’s just pointing to your actual data residing in cloud storage.

2. Compute: Compute costs can be broken down into two sub-components.

The first part is the cloud infrastructure that you need to perform the compute processing (e.g. EC2 instances). This cost is paid directly to your Cloud Service Provider for the duration that it is awake. The second part is where Databricks comes in, where you consume DBUs based on how long that compute cluster is running for.

To illustrate, imagine that you have a data pipeline that ran for 1 hour which consumed 2 DBUs to complete the job using an i3.xlarge cluster — you pay AWS directly for the compute capacity you used for that instance type during the hour, and then pay an additional DBU cost to Databricks.

In more simple terms, think of Databricks as the “software” that sits on top of the “hardware”

Essentially, when the job kickstarts, Databricks switches on those EC2 instances behind the scenes, completes the task at hand, and then switches them off after a specified time interval of inactivity to save costs. This is achieved via Databricks’ auto-terminate feature.

Configuring your cluster in Databricks with auto-terminate

4. Factors that affect DBU consumption

The amount of DBU consumed is fundamentally driven by the time it takes for the cluster to complete a data workload (a DBU is a unit of processing per hour, billed on per-second increments of usage). This workload may come in the form of running an ETL pipeline, powering a BI tool, or training a Machine Learning model.

In terms of what influences computation time, below are three main factors to consider:

1. Data Volume: for obvious reasons, processing a 10TB data set will take more time than processing a 1TB data set (all other things being equal)

2. Data Velocity: running a streaming ETL pipeline that runs 24/7 will certainly drive more DBU consumption than a batch ETL pipeline where data only gets loaded and transformed once or twice a day

3. Data Complexity: complex data transformations such as table upserts, RegEx string matching, and deduplication will take more time to complete than simple data aggregations, thus driving more DBU usage

5. Factors that affect the $ Rate per DBU

Moving on to the second part of the cost equation, the $ Rate per DBU, this component is influenced by the following factors:

Example pricing matrix illustrating how the CSP, subscription tier, and compute type influence the $ Rate per DBU (Source: https://www.databricks.com/product/aws-pricing)

1. Cloud Service Provider & Region - the decision to deploy Databricks on either AWS, Azure, or GCP

2. Subscription Tier - as of this writing, there are three subscription tiers to choose from — Standard, Premium, Enterprise

3. Compute Type - Databricks offers different compute types that are optimized for specific workloads. The three most common ones are:

• Jobs Compute: used to run data engineering pipelines

• SQL Compute: used for SQL queries and BI reporting

• All-Purpose Compute: used for data science and machine learning

6. Bringing it all together with a case study

Databricks Pricing Calculator (Source: https://www.databricks.com/product/aws-pricing/instance-types)

To wrap-up, let’s try to calculate the monthly Databricks DBU cost with a practical scenario:

  • An organization decided to deploy Databricks on AWS and subscribe to the Premium tier
  • It needs to run an hourly data pipeline from 9:00am to 5:00pm each day (a total of 8 hours per day)
  • On average, 5 instances will make up the compute cluster to run the job
  • The chosen AWS instance type is i3.xlarge which consumes 1 DBU per hour, at a rate of $0.15 per DBU

Based on the information given above, the mathematical formula can be derived as follows:

Expressing the given inputs above into a mathematical formula

Summary

To summarize, below are several key takeaways to remember:

  1. Similar to the Kilowatt-hour (kWh), a Databricks Unit (DBU) is a unit of processing capability per hour, billed on a per second usage.
  2. The components that make up the Total Cost of Ownership of running Databricks are 1) Storage and 2) Compute.
  3. Storage costs are paid directly to your Cloud Service Provider for storing data in cloud object storage (e.g. S3, ADLS).
  4. Compute costs are broken down into two sub-components: the first part is the actual compute infrastructure (e.g. EC2, Azure VMs) paid directly to your CSP. The second part is where you consume DBUs based on how long the compute cluster is running for.
  5. Databricks comes built-in with cost saving features such as auto-terminate to switch off the underlying compute infrastructure when it’s idle for a certain period of time
  6. Factors that affect DBU consumption (or “thinking time” in layman’s terms) are: data volume, data velocity, and data complexity
  7. Factors that affect the $ DBU Rate are: cloud service provider & region, subscription tier, and compute type

So there you have it folks, Databricks pricing in a nutshell. If you like what you’ve read, please subscribe to my Medium account and stay tuned for more of my “In Plain English” articles in the future!

Disclaimer: The opinions expressed within this article are solely the author’s and do not reflect the opinions and beliefs of Medium, the author’s employer, or any other affiliates.

--

--

Mendelsohn Neil Chan

📊 Data Engineering x Technical Presales ☕ Certified Third-Wave Coffee Snob 📚 Udemy Course Creator 👨‍💻 All views are my own.