Azure Databricks Hands-on

7 min readSep 6, 2018

This tutorial will explain what is Databricks and give you the main steps to get started on Azure.

Updated version with new Azure ADSL Gen2 available here

TL;DR

The first part will be relative to the setup of the environment. The second part will be the steps to get a working notebook that gets data from an Azure blob storage. The last part will give you some basic queries to check that everything is working properly.

Introduction

What is Spark?

Apache Spark is an open-source cluster-computing framework. Spark is a scalable, massively parallel, in-memory execution environment for running analytics applications.

What is Databricks?

Databricks is a San Francisco based company that provides the eponym product Databriks. Databricks delivers a unified analytics platform powered by Apache Spark. Learn more about Databricks solution here.

Databricks removes all the hardness and complexity to get a Spark cluster. They provide a seamless, zero-management, Spark experience thanks to the integration with major cloud providers including Amazon AWS and Microsoft Azure.

You can get working fully managed cluster in 5 minutes tops on Azure.

Likewise, in you are familiar with Jupyter or Zeppelin notebooks, you will feel at home with Databricks notebooks as this is the central part for developing.

Python and Scala languages are supported, and notebook can mix both.

Create your first cluster on Microsoft Azure

From your Azure subscription, create the Azure Databricks service resource:

Create your Databricks workspace in Azure

Then run the workspace on the resource created:

You should now be in the Databricks workspace:

The next step is to create a cluster that will execute the source code present in your notebooks.

Important note: If you want to write your code in Scala in addition to Python, you need to choose “Standard” cluster instead of “Serverless” cluster.

You can adjust the cluster size later according to the price you are willing to pay. Notice that the cluster will be shut down automatically after a period of inactivity.

The creation of the cluster can take several minutes. During the meanwhile, we can create our first notebook and attach it to this cluster.

Before going further, let’s dig into some glossary.

Some Spark glossary

Spark context

Spark Context is an object that tells Spark how and where to access a cluster. In a Databricks notebook, the Spark Context is already defined as a global variable sc.

Spark session

Spark Session is the entry point for reading data and execute SQL queries over data and getting the results. Technically speaking, Spark session is the entry point for SQLContext and HiveContext to use the DataFrame API.

Spark Session can also be used to set runtime configuration options.

In a Databricks notebook, the Spark session is already defined as a global variable spark.

Python

Scala

RDD: Resilient Distributed Dataset (RDD)

An RDD is an immutable distributed collection of data partitioned across nodes in your cluster with a low-level API. It is schema-less and used for low-level transformation and actions.

Dataframe (DF)

A DataFrame is a distributed collection of rows under named columns. It is the same as a table in a relational database. It is closed to Pandas DataFrames. A DataFrame has the ability to handle petabytes of data and is built on top of RDDs. A DataFrame is mapped to a relational schema.

Dataset

A Dataset is a strongly-typed DataFrame. A DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object.

DataFrame and Dataset are now merged in a unified APIs in Spark 2.0. Learn more about the differences between DF, Dataset, and RDD with this link from Databricks blog.

Python or Scala notebooks?

This is the main question every new developer on Spark asks. If you are familiar with Python (Python is the language of choice for data scientist), you can stick to Python as you can do almost everything in Python.

However, Scala is the native language for Spark. Because Spark is itself written in Scala so you will find 80% of the examples, libraries, and discussions in StackOverflow in Scala.

The good news is that you don’t have to choose in Databricks notebook as you can mix both languages so as it is simpler to develop with Python, I do recommend using Python first and use Scala when Python is not enough.

The Python library to deals with Spark is named PySpark.

Create your first python notebook

From the home page:

You can also create a notebook in a specific location:

Then select the notebook default language:

Once the notebook is created, attach it to your cluster:

Enter some Python and Scala code. For Scala, you need to add %scala at the first line since the default language we choose is Python:

To execute the code, you can use the shortcut CTRL+ENTER or SHIFT+ENTER. If the cluster is not running, a prompt will ask you to confirm its startup.

We are now ready for the next step: the data ingestion. However, we need some input data to deal with. We will set up the credentials to be able to load Azure data inside our spark cluster.

If you don’t want to use an Azure blob storage, you can skip the Azure storage setup part and get the CSV content directly from the URL with the following code:

Setup your Azure Storage account

First, get the three parameters from your storage:

The account name
The access key
The blob container name

Then, there are three ways to get data from Azure storage from PySpark:

using a WASB file path formatted like this:
wasbs://<containername>@<accountname>.blob.core.windows.net/<partialPath>
WASB (Windows Azure Storage Blob) is an extension built on top of the HDFS APIs. HDFS, the Hadoop Distributed File System, is one of the core Hadoop components that manage data and storage on multiple nodes.
Using a mount point on worker nodes with Databricks FS protocol and request files using a file path like:
dbfs:/mnt/<containername>/<partialPath>
using a mount point and request files using a regular file path:
/dbfs/mnt/<containername>/<partialPath>

If you are using PySpark functions, you should use 1) or 2). If you are using regular Python IO API, you should use 3).

To set up the file access, you need to do this:

Notice that we are using dbutils, a Databricks library already imported. dbutils.fs provides files relative function that cannot be used in parallel tasks.