Azure Databricks Hands-on

Jean-Christophe Baey
7 min readSep 6, 2018

--

Photo by Christopher Burns on Unsplash

This tutorial will explain what is Databricks and give you the main steps to get started on Azure.

Updated version with new Azure ADSL Gen2 available here

TL;DR

The first part will be relative to the setup of the environment. The second part will be the steps to get a working notebook that gets data from an Azure blob storage. The last part will give you some basic queries to check that everything is working properly.

Introduction

What is Spark?

Apache Spark is an open-source cluster-computing framework. Spark is a scalable, massively parallel, in-memory execution environment for running analytics applications.

What is Databricks?

Databricks is a San Francisco based company that provides the eponym product Databriks. Databricks delivers a unified analytics platform powered by Apache Spark. Learn more about Databricks solution here.

Databricks removes all the hardness and complexity to get a Spark cluster. They provide a seamless, zero-management, Spark experience thanks to the integration with major cloud providers including Amazon AWS and Microsoft Azure.

You can get working fully managed cluster in 5 minutes tops on Azure.

Likewise, in you are familiar with Jupyter or Zeppelin notebooks, you will feel at home with Databricks notebooks as this is the central part for developing.

Python and Scala languages are supported, and notebook can mix both.

Create your first cluster on Microsoft Azure

From your Azure subscription, create the Azure Databricks service resource:

Create your Databricks workspace in Azure

Then run the workspace on the resource created:

Launch button in Azure portal

You should now be in the Databricks workspace:

The next step is to create a cluster that will execute the source code present in your notebooks.

Important note: If you want to write your code in Scala in addition to Python, you need to choose “Standard” cluster instead of “Serverless” cluster.

Cluster properties

You can adjust the cluster size later according to the price you are willing to pay. Notice that the cluster will be shut down automatically after a period of inactivity.

The creation of the cluster can take several minutes. During the meanwhile, we can create our first notebook and attach it to this cluster.

Before going further, let’s dig into some glossary.

Some Spark glossary

Spark context

Spark Context is an object that tells Spark how and where to access a cluster. In a Databricks notebook, the Spark Context is already defined as a global variable sc.

Spark session

Spark Session is the entry point for reading data and execute SQL queries over data and getting the results. Technically speaking, Spark session is the entry point for SQLContext and HiveContext to use the DataFrame API.

Spark Session can also be used to set runtime configuration options.

In a Databricks notebook, the Spark session is already defined as a global variable spark.

Python
Scala

RDD: Resilient Distributed Dataset (RDD)

An RDD is an immutable distributed collection of data partitioned across nodes in your cluster with a low-level API. It is schema-less and used for low-level transformation and actions.

Dataframe (DF)

A DataFrame is a distributed collection of rows under named columns. It is the same as a table in a relational database. It is closed to Pandas DataFrames. A DataFrame has the ability to handle petabytes of data and is built on top of RDDs. A DataFrame is mapped to a relational schema.

Dataset

A Dataset is a strongly-typed DataFrame. A DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object.

DataFrame and Dataset are now merged in a unified APIs in Spark 2.0. Learn more about the differences between DF, Dataset, and RDD with this link from Databricks blog.

Python or Scala notebooks?

This is the main question every new developer on Spark asks. If you are familiar with Python (Python is the language of choice for data scientist), you can stick to Python as you can do almost everything in Python.

However, Scala is the native language for Spark. Because Spark is itself written in Scala so you will find 80% of the examples, libraries, and discussions in StackOverflow in Scala.

The good news is that you don’t have to choose in Databricks notebook as you can mix both languages so as it is simpler to develop with Python, I do recommend using Python first and use Scala when Python is not enough.

The Python library to deals with Spark is named PySpark.

Create your first python notebook

  • From the home page:
Notebook creation from Home
  • You can also create a notebook in a specific location:
Notebook creation from “Workspace”

Then select the notebook default language:

Choose your default language

Once the notebook is created, attach it to your cluster:

  • Enter some Python and Scala code. For Scala, you need to add %scala at the first line since the default language we choose is Python:
  • To execute the code, you can use the shortcut CTRL+ENTER or SHIFT+ENTER. If the cluster is not running, a prompt will ask you to confirm its startup.

We are now ready for the next step: the data ingestion. However, we need some input data to deal with. We will set up the credentials to be able to load Azure data inside our spark cluster.

If you don’t want to use an Azure blob storage, you can skip the Azure storage setup part and get the CSV content directly from the URL with the following code:

Setup your Azure Storage account

First, get the three parameters from your storage:

  • The account name
  • The access key
  • The blob container name

Then, there are three ways to get data from Azure storage from PySpark:

  1. using a WASB file path formatted like this:
    wasbs://<containername>@<accountname>.blob.core.windows.net/<partialPath>
    WASB (Windows Azure Storage Blob) is an extension built on top of the HDFS APIs. HDFS, the Hadoop Distributed File System, is one of the core Hadoop components that manage data and storage on multiple nodes.
  2. Using a mount point on worker nodes with Databricks FS protocol and request files using a file path like:
    dbfs:/mnt/<containername>/<partialPath>
  3. using a mount point and request files using a regular file path:
    /dbfs/mnt/<containername>/<partialPath>

If you are using PySpark functions, you should use 1) or 2). If you are using regular Python IO API, you should use 3).

To set up the file access, you need to do this:

Notice that we are using dbutils, a Databricks library already imported. dbutils.fs provides files relative function that cannot be used in parallel tasks.

You can also call directly filesystem function of dbutils using %fs prefix:

Let’s try to put a JSON file in our Azure container and then load it in a Spark Dataframe to make sure everything is working properly.

For our example, we will get the exchange rate file EURO/USD since 2000 in CSV format.:

Date, Rate
2000–01–01,0.9954210631096955
2000–01–02,0.9954210631096955

Get the file from https://timeseries.surge.sh/usd_to_eur.csv and store it in a container in the Azure blob storage.

Then we load the file into a Spark dataframe either using wasbs or dbfs

From our dataframe, let’s do some basic statistics on the DF:

We can register the input dataframe as a temporary view named xrate in the SQL context thanks to this command:

Then we can run SQL queries and get the data from xrate view:

Notice that the display function can do more than displaying a table with some basic chart features:

Likewise, we can query the SQL directly in the cell thanks to %sql prefix:

Finally, instead of defining the query with a string, we can use the PySpark API:

If you want to switch back to Scala, you can use the SQLContext to exchange a dataframe within a %scala cell:

We finish the first part of our hands-on post. Your next step is to practice the PySpark API and think in data frames.

If you like what you read, don’t forget to clap!

Part 2: PySpark API by python examples

--

--

Jean-Christophe Baey

Entrepreneur, creator of @screenpresso, Software architect at @Groupe_Renault. Passionate about tech, content, design, software & startups.