Azure Databricks Compute Types — Data Analytics, Data Engineering and Data Engineering Light Clusters

Objective

Inderjit Rana
Microsoft Azure
6 min readOct 21, 2020

--

Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. When getting started with Azure Databricks I have observed a little bit of struggle grasping some of the concepts around capability matrix, associated pricing and how they translate to implementation. The objective of my post here is to provide a clear picture in your mind so that you understand the details and make good choices for your workloads.

What you will learn?

You will learn about the following concepts and how they relate to each other, I like to think of them as different dimensions of Azure Databricks environment:

  • Pricing Tiers — Standard vs Premium
  • Data Analytics, Data Engineering and Data Engineering Light Cluster Types (pricing page refers to these as Workload Types) — You will learn how to create clusters of each of these types (creation of Data Engineering Light cluster might be little bit more hidden and not very obvious)
  • High Concurrency vs Standard Cluster Modes — Only applicable to Data Analytics Clusters

The Pricing Page lists capability matrix for pricing tier as well as cluster type features shown in the following screenshot (please see the pricing page for most up to date pricing information).

Update March 30, 2021 — Azure Databricks Cluster Types have been renamed Data Analytics is now referred to as All-Purpose Compute, Data Engineering is Jobs Compute and Data Engineering Light is Jobs Light Compute so please interpret this article appropriately. I don’t think anything else has changed and I expect this blog post still to be helpful.

Update May 6, 2022 — SQL Compute is another kind of Compute made available in Premium Tier

Azure Databricks Capability Matrix
Azure Databricks Capability Matrix on Pricing Page

Data Analytics Cluster costs more than Data Engineering Cluster which is more expensive than Data Engineering Light.

Overall Picture

Before I get into deeper details around individual concepts I wanted to show the following diagram which should be helpful in understanding how different dimensions relate to each other.

Azure Databricks — Relationship between various dimensions

Standard or Premium Pricing Tier — Databricks Workspace Level

Databricks Workspace is at the highest level and forms the environment for accessing all your Azure Databricks assets (you can have multiple clusters of different types within a single Workspace). When getting started with Azure Databricks the first thing you will do is to create an Azure Databricks Workspace and this where you will choose the pricing tier to be Standard or Premium. All cluster types are available in either of the pricing tier but it’s features which vary between tiers.

Data Analytics Clusters

The pricing page uses the term Data Analytics workload for interactive clusters so both terms are equivalent, at times you will also see these clusters referred to as All-Purpose Cluster. You can create clusters using UI or REST API (CLI is pretty much the same as REST API). These are the cluster types typically used for interactively running Notebooks.

Azure Databricks — Create Data Analytics/Interactive/All-Purpose Cluster using UI

Data Analytics Cluster Modes

The Interactive clusters support two modes:

  • Standard Concurrency
  • High Concurrency

Differences are summarized really well on the following Best Practices Github Repo (scroll little bit on this given link to see the table which nicely compares the two modes)

Job Cluster Type — Data Engineering

Documentation uses the term Job Clusters for both Data Engineering and Data Engineering Light cluster types. You might also see these clusters referred to as Automated Clusters because the main purpose of these clusters is to run a job and then terminate.

Data Engineering Cluster Creation using UI

When you execute a one time job or schedule a job from Azure Databricks Workspace you specify cluster configuration as part of the job creation setup. The following two screenshots show the location where cluster configuration is specified. Clicking Edit link for the Cluster setting opens up Cluster configuration page (shown on the second screenshot below).

Create Job

As shown in the below screenshot of Configure Cluster page, selecting New Job Cluster for Cluster Type results in a Data Engineering Cluster to be created as long as the Databricks Runtime Version selection is not set to Light (please see the Data Engineering Light section below to understand better).

Configure Job Cluster

Note: The other option on the Cluster Type dropdown is Existing All-Purpose Cluster, specifying this option would result in running a Job on an Interactive Cluster (or Data Analytics Cluster).

Data Engineering Cluster Creation using REST API

  • Use Job REST API to create a Data Engineering Cluster
  • Specify new_cluster field in the POST request to create and run job on a new Data Engineering Cluster.
  • Specify the spark_version sub-field of new_cluster field any valid value other than Data Engineering Light (explained in more detail on the Databricks Runtime Version section at the end of this post)

Job Cluster Type — Data Engineering Light

Databricks Engineering Light is the most basic version and lacks quite a few nice features provided by other cluster types but there might still be few folks interested in using it so adding this section for completion sake, you can read more about it here — https://docs.microsoft.com/en-us/azure/databricks/runtime/light

In my opinion creation of this cluster type is least obvious. Since this is also a job cluster you specify Data Engineer Light as part of the Job creation steps.

Data Engineering Light Cluster Creation using UI

On the cluster creation page, in addition to specifying New Job Cluster for Cluster Type, Data Engineering Light option needs to be selected for the Databricks Runtime Version Dropdown (specifying any other runtime version will result in creation of Data Engineering cluster type).

Configure Job Cluster — Data Engineering Light

Data Engineering Light Cluster Creation using REST API

  • Use Jobs REST API to create Data Engineering Light cluster
  • Specify new_cluster field on the Job Create HTTP Post request
  • Specify spark_version sub-field of new_cluster field to be one of of the Databricks Runtime Versions for Data Engineering Light (Databricks Runtime Version explained in more detail in the next section

Databricks Runtime Version

When creating a new cluster using REST API spark_version field value needs to be set to one of valid values for Databricks Runtime version. Spark Runtime Version HTTP GET request can be used to retrieve valid values for this spark_version field.

Databricks Runtime Version for Data Engineering Light follow a slightly different naming convention than the others:

  • Runtime Version example for Data Engineering Light Cluster Type — apache-spark-2.4.x-scala2.11
  • Runtime Version example for other cluster types — 5.5.x-scala2.10

This is documented at the following link — https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/#programmatic-version

Summary

Azure Databricks is one of my favorite and easy to use services on the Azure Platform, some of the aspects which I refer to as dimensions can get a little confusing if you are starting off but I hope this post met its objective of providing you additional clarity.

--

--