Azure Databricks for Data Noobs Part 1 — The Basics

Farooq Mahmud
Analytics Vidhya
Published in
8 min readApr 19, 2020

--

Introduction

My previous posts explained how to get a local Spark instance up and running in Docker, load data, and query data. I also mentioned that this is an excellent way to get introduced to Spark because it is easy to set up and run.

When you’re ready to graduate to an enterprise-grade Spark instance, you will want to load and analyze data on a Spark cluster because it can handle massive datasets with ease. However, configuring a Spark cluster is far from trivial.

Enter Azure Databricks.

After reading this article, you will have accomplished the following:

  1. Learn the benefits of Azure Databricks and what problems it solves.
  2. Set up an Azure Databricks environment using PowerShell and the Azure CLI. Why? Because it is good to “automate all the things.”
  3. Create an Azure Databricks cluster.
  4. Create a Databricks notebook.
  5. Run notebook code that loads a data set from the GitHub repository do some basic clean-up and view the data.

GitHub Repository

All the code mentioned in this article is located in GitHub.

Databricks and Azure Databricks

Yes, these are two different things. Databricks was created in 2013 by the folks who created Apache Spark. The Spark team realized that setting up an enterprise-grade Spark environment was hard. Databricks abstracts a lot of this complexity through automation. Combine this with a cloud provider and you have a Spark platform that automatically scales, is secure, and can connect with a legion of services. There is a nice Databricks overview in the context of AWS on the Databricks documentation site.

Azure Databricks is a Spark platform that automatically scales, is secure, and can connect with a legion of services by leveraging the Azure platform:

  • Azure compute resources provide scalability.
  • Networking resources and Azure Active Directory provide security.
  • Azure Databricks can work with blob storage, data lakes, SQL Servers, IoT Hubs, and many more Azure services.

Here is a nice Databricks overview in the context of Azure.

Set Up the Databricks Environment

The Azure Databricks environment consists of a resource group. In that resource group, we create an Azure Databricks workspace resource. An Azure Databricks workspace is your entry point into the Azure Databricks environment.

Note: Before continuing, log in to Azure from your PowerShell session. Run az login and optionally az account set --subscription <<your subscription name>> if you need to change your subscription from the default.

Create an Azure Resource Group

#Create resource group
$resourceGroupName = “<<name>>”
$resourceGroupLocation = “<<location>>”
az group create `
— name $resourceGroupName `
— location $resourceGroupLocation

Create an Azure Databricks Workspace

Unfortunately, Databricks workspaces cannot be created using the Azure CLI. Therefore we deploy using a publicly available ARM template. The parameters to the ARM template are dynamically created in the PowerShell script below. Now is an excellent time to get a drink or snack as the deployment takes several minutes.

#Create Databricks workspace
$workspaceName = ‘<<your workspace name>>’
$armTemplateUri = ‘https://raw.githubusercontent.com/Azure/azure-quickstart-templates/master/101-databricks-workspace/azuredeploy.json'$armTemplateParameters = @{
pricingTier = @{ value = ‘premium’ }
location = @{ value = $resourceGroupLocation }
workspaceName = @{ value = $workspaceName }
} | ConvertTo-Json

$tempFileName = New-TemporaryFile
[System.IO.File]::WriteAllText($tempFileName, $armTemplateParameters)
try {
az group deployment create `
— resource-group $resourceGroupName `
— template-uri $armTemplateUri `
— parameters “””$tempFileName”””
} finally {
Remove-Item -Path $tempFileName -Force
}

Now that the Databricks workspace is running, we can create a cluster within the workspace. A Databricks cluster is what executes the Spark jobs you submit, i.e., Python code in notebooks.

  1. In the Azure portal, go to the Databricks resource and click the Launch Workspace button.

2. Click the Clusters icon.

3. Click the Create Cluster button.

4. Give your cluster a name. Change the Min Workers and Max Workers settings to 2 and 4, respectively. Click Create Cluster.

5. You are redirected to the cluster listing page. After a few seconds, your cluster will appear in the list in a Pending state. It takes a few minutes for the cluster’s state to transition to Running.

Aside — “Automate All the Things”

If you’re wondering if a cluster creation can be automated, the answer is yes! There is a Python-based Databricks CLI available. I intentionally did not mention it in this article because I did not want to introduce additional complexity. If you are interested in setting up the Databricks CLI, check out the Azure Databricks documentation.

Where Are We?

At this point, we have an Azure Databricks workspace running a Spark cluster! Now we can load that flight data file in the Data Lake into Spark and analyze it.

Databricks Infrastructure

Before doing anything with the data, let us take a look at the infrastructure Azure created for us. When I first saw all the resources that are involved in a Databricks cluster, I was thankful I did not have to set this up myself (or find an expert to set this up).

Creating a cluster creates virtual machines, a storage account, and networking resources. These resources are contained in what is known as a managed resource group. You get to this resource group via the Databricks workspace page in the Azure portal.

Observe that the URL in the image above is the URL to the Databricks UI. You can bookmark this URL for quick access to the UI.

Click on the link to see the managed resource group’s resources. Notice there are two virtual machines because we specified a minimum of two worker nodes when creating the cluster. The storage account resource contains the Databricks File System (DBFS). Observe that if you try to poke around the storage account containers, you will get an unauthorized message. Azure constrains DBFS access.

Create a Databricks Notebook

Let’s create a Python notebook, which we will use to do our analysis.

  1. Browse to the Databricks UI.
  2. Click Workspace > Create > Notebook. Provide a notebook name.

3. The notebook opens in edit mode. Now we can write Python code to load and explore flight data!

Running Code

You write Databricks code in cells. A cell can contain one line of code, or it can contain hundreds — it’s up to you. You have options when it comes to running code in a cell. Keyboard aficionados can press CTRL+ENTER. GUI minded folks can click the Run Cell button.

You can also press SHIFT+ENTER to run code. The difference is that this method will execute the code and create a new cell.

Aside — Magic Commands

You can explore the driver node’s file system from a Databricks notebook. Run the following code in a cell to print the current working directory and its contents:

%sh
pwd
ls

You should see a listing similar to the following:

The %sh command is called a magic command. The %sh magic command allows you to run shell commands in a cell.

The magic command %fs allows you to view the DBFS:

%fs
ls

Even though we created a Python notebook, you can run Scala code using a magic command:

%scala
println(“Hello, world”)

There is a magic command for markdown, which is useful for documenting notebooks. Run the following code in a cell and see the rendered content:

%md # Exploring the Databricks File System *(DBFS)*

Import a Data Set From the GitHub Repository

Running the code below in a notebook cell downloads the flight data set from this article’s GitHub repository.

import requestscsv = requests.get(‘https://raw.githubusercontent.com/farooq-teqniqly/blog-databricks-for-noobs/master/flight_data.txt').text

Now we need to save the text in the csv variable to a file in the DBFS so that we can load it into a data frame. Running the code below creates a folder named demo_data and saves the downloaded data to a file in that folder with the name of flight_data.csv.

dbutils.fs.mkdirs(‘/demo_data/’)
dbutils.fs.put(‘/demo_data/flight_data.csv’, csv)

Now load the file into a data frame. Note the need for the delimiter option because the column delimiter is not a comma.

flight_data_df = (spark.read
.format(‘csv’)
.options(header=’true’, inferSchema=’true’, delimiter=’|’)
.load(‘dbfs:/demo_data/flight_data.csv’))

Aside — Loading Files

In an enterprise context, data would be loaded from an Azure service like blob storage or Data Lake storage. I intentionally did not want to load the file in this manner due to the added complexity. However, I will cover this scenario in a future article.

Spark Execution Model

After executing the previous code, observe that the output contains the schema of the data frame and information about the Spark jobs executing the code.

The data frame includes two empty columns, _c0 and c_17. Let's remove them. This is an example of how data sets usually need to be cleaned up before analysis. Run the code below in a cell.

flight_data_df = flight_data_df.drop(‘_c0’, ‘_c17’)

The output shows the new schema of the data frame. The columns are gone, but where is the Spark job information?

In a past post, I mentioned Spark defers query execution until you want to see the data. So let’s see the data (or part of it anyway):

flight_data_df.show()

And we see the resulting job:

Cleanup

Please remove the resource group to avoid incurring additional costs. Run the following in a PowerShell session:

az group delete — name $resourceGroupName — yes

Conclusion

If you got this far, thank you for reading! Hopefully, you got a useful overview of Azure Databricks and will embark on your own Azure Databricks journey.

--

--

Farooq Mahmud
Analytics Vidhya

I am a software engineer at Marel, an Icelandic company that makes machines for meat, fish, and poultry processing.