Get Started with Azure Blobs in Databricks

Featuring Azure Key Vault and Spark Pandas

Charlotte Patola
CodeX
6 min readOct 4, 2022

--

Azure blobs in Databricks

This tutorial will go through how to read and write data to/from Azure blobs using Spark Pandas¹ in Databricks. The blob connection is accessed via Azure Key Vault.

This is part 3 in a series about Databricks:

  1. Get Started with Pandas in Databricks
  2. Get started with Azure SQL in Databricks
  3. Get Started with Azure Blobs in Databricks

Prerequisites

  1. An instance of Databricks via Azure
  2. An Azure Storage Account
  3. Familiarity with Databrick’s basic features
  4. Some familiarity with python Pandas

Our Demo Case

As example data, we have a CSV file holding course evaluations. After uploading this file to an Azure blob, we enrich the data with a mean of the evaluations per student and then write the result back to another Azure blob. You will find the file here on Github (double click and Save as).

In order to safeguard access to the blob, we will access it via a key vault. We will also adhere to the DRY principle by creating a reusable connection function to be used by our Databricks notebooks.

Blob Setup

We create one blob for the raw data and one for the transformed output data. We then upload the course_feedback CSV file (see Our Demo Case above) to the raw data blob.

Blob setup

Key Vault Setup

A key vault is a “credentials storage” where one can save and access credentials safely. One should always refrain from writing out credentials in the code or associated files, as they can easily get leaked from there or accidentally end up in the version control system.

Key Vault Creation in Azure

To create a key vault in Azure, search for “key vault” in the resource search window at the top of the portal.

When you have created the key vault (with the default settings), go to Secrets in the left pane. From there, click Generate/Import. Set a value for the fields Name and Secret Value.

Creating a Secret
  • Name —As we’ll store the storage account key as our secret value, let’s name it accordingly
  • Secret Value — You’ll find the storage account access keys when you click your storage account and go to Access Keys. You can use either key1 or key2. Click Show to see the key and copy it.
Storage Account Access Keys

Secret Scope Setup in Databricks

After the key vault is set up, we need to create a Databricks secret scope.

Set up Databricks secret scope
  • When you are on the starting page of your Databricks instance, add secrets/createScope to the end of the URL
  • On the page, you are directed to give the name of your Azure key vault in the Scope Name field
  • Set the Manage Principal dropdown to All Users. This means that all users are allowed to read and write to this secret scope. You find more info about these settings here
  • In the DNS field, insert the vault URI you find under the Key Vault’s Properties
  • Set Resource ID, to the Resource ID you find under the Key Vault’s Properties

The key vault setup is now finished, and the storage account key can be accessed from Databricks.

Accessing the Storage Account Blobs from Databricks

We will access our blobs by mounting them to a folder structure using the WASB driver². This way we can access the CSV files the same way we would access local files.

We will create reusable mounting and unmounting functions that we will save in a separate Databricks notebook. These functions can be called by any notebook in our Databricks instance. The function call is done by running the function notebook from within the calling notebook. After this, all data (functions, variables, constants…) from the other notebook are available to the calling notebook.

You find the whole code for our demo below and here on GitHub (connect_azure_blob & course_feedback_blob).

Mounting the Blob

We start by checking if the blob is already mounted. If not, we proceed with the connection properties, accessed via the secret in our key vault. We finish with an error message, in case the mounting was not successful.

Unmounting the Blob

When we are finished and don’t need the blob connection anymore, it is considered good practice to unmount it. In order to handle a possible situation, where the blob has already been unmounted, we include a check for this.

Finishing the Mounting Notebook

In order to make the use of the mounting notebook a bit easier, we add some information about its functions. Static info can be written in a markdown (%md) cell. This info will be printed out when the notebook is run. When the notebook is ready, we create a Functions folder under our shared workspace and save it there. I save the notebook with the name connect_azure_blob.

Wrangling Notebook

After creating our new notebook — I named it course_feedback_blob — and saving it in the Shared Workspace, on the same level as the Functions folder, we continue by importing Spark Pandas and running the connect_azure_blob notebook with the %run command.

When the notebook is run, the notebook information will be displayed.

The function info displayed in the calling notebook

Accessing Blob Data

Now, we mount both the rawdata-courses blob and transformeddata-courses blob. If you often read and write from the same storage account, it is a good idea to give the three first parameters default values in the function definition. read_csv will now read the CSV file and save it as a pandas data frame, just as if it would have been a local file. We only have to remember that the path has to start with /mnt/name_of_our_blob.

Wrangling

The only modification we will make is to count an overall score for each course feedback row.

Writing the Output to a Blob

The transformed data frame is now ready to be written to the transformeddata-courses blob. To make sure that it arrived safely, we can read it back and inspect it.

The last step is to unmount the used blobs.

Status

We have now successfully set up a key vault, connected it to Databricks, created a notebook with functions to mount and unmount Azure blobs, and finally used them in a notebook where we read data from one blob, wrangle it with Spark Pandas and write it to another blob.

[1]: The well know Pandas functionality, but on Spark — you can read more about it here.

[2]: There are also other ways to connect to an Azure blob, for example, here is information about the ABFS driver.

[3]: This setting is needed in order to be able to create a new column in the spark python dataframe. You find more info and examples in this notebook.

--

--