Azure Databricks for Data Noobs Part 2 — Run Notebooks as Jobs

Farooq Mahmud
Analytics Vidhya
Published in
7 min readJun 17, 2020

Running a Databricks notebook as a job is an easy way to operationalize all the great notebooks you have created. I think the two biggest benefits are:

  1. Jobs allow you to run notebooks on a scheduled basis.
  2. Jobs run on on-demand clusters. This means that you don’t have to have a constantly running cluster just to run jobs. This keeps costs in check.

In this article we will do the following:

  1. Create a notebook that downloads a dataset, transforms it, and saves the transformed data set to DBFS.
  2. Run the notebook interactively and verify the notebook works.
  3. Run the notebook as a job and verify the job works.

Keep in mind that changes may need to be made to the notebook in order for it to run as a job. The changes mostly relate to resource access.

Environment Setup

First, we need to provision our Azure Databricks workspace. As usual, I will use the Azure CLI. Install the CLI if necessary and then start a Powershell session.

Login to Azure

Run the following to login and set your subscription:

Create a Resource Group

Run the following to create a resource group:

Create a Databricks Workspace

Run the script below to create a Databricks workspace. This will take a few minutes to complete.

Verify your resource group contains a Databricks workspace by running the script below:

And getting this response:

[“Microsoft.Databricks/workspaces”]

Open the Databricks Workspace

Open the Azure Portal, click the Databricks workspace resource, and launch the workspace.

At this point the environment is setup. Now let’s create a notebook!

Create the Notebook

Create a new notebook and add the code below.

The code above downloads United States census data from Azure Open Datasets. Line 14 creates a starter dataframe. Line 16 transforms the dataset via an SQL query. The resulting dataset is written to DBFS on line 27.

You can substitute your own county and state in the query if you like. If you do be sure to add a space followed by “County” to your county’s name.

Create the Cluster

We have no cluster to run the notebook on so create one. This is easily done within the workspace UI but we can also use the Databricks REST API. It’s a good idea to become familiar with the REST API especially if you ever need to do CI/CD with notebooks.

Create the Bearer Token

A bearer token is needed to authenticate with the REST API. Click the Account icon in the upper right corner and select User Settings. Then click the Generate New Token button. Give it a name and click Generate. Save the value as we’ll need it later and once you close the modal the token cannot be retrieved.

Create the Cluster Using the API

Running the following cURL command creates a cluster named demo-cluster with the following specs:

  • Spark version 6.6
  • Terminate after 30 minutes of inactivity.
  • Spawn a minimum of 1 and a maximum of 2 worker nodes.
  • The driver and worker nodes run Standard_DS3_v2 instances.

Be sure to replace the Databricks workspace URI with your workspace’s URI. Also, replace the bearer token with your bearer token.

The response contains the cluster id. We will need to supply this value when in the next step.

Verify the Cluster is Running Using the API

Run the following cURL command and verify the state is shown as RUNNING in the output. Be sure to replace the bearer token, workspace URI, and cluster id with your values.

Run the Notebook

Attach the notebook to the cluster and run it. If all goes well you should see several parquet files listed in the output.

Where Are We?

At this point, we have a working notebook. However, there are some issues:

  1. The county and state are hardcoded. If this notebook would only be run interactively it would be nice to give the user the option of specifying the county and state at runtime. If we want to run the notebook as a job, this is a must.
  2. The parquet files are written to the DBFS on the cluster. You probably don’t want to increase the load by having multiple users running these notebooks. What if the cluster is deleted? We can do better.
  3. The previous parquet files are overwritten on each run. It makes more sense to organize the files so that they can be accessed whenever they are needed.

Let’s address each issue in turn.

Issue #1: Hardcoded County and State

We can add a Databricks widget which allows a user to specify the county and state at runtime. The cool thing about widgets is that values for them can be provided when the notebook is run as a job.

Edit the notebook code to match the following:

Two textbox widgets are added on lines 18 and 19. The values for the state and county, respectively, are retrieved on lines 21 and 22. The query uses these values on lines 25 and 26.

Run the notebook and it will fail but the two widgets will appear at the top of the notebook. Enter a state and county:

Run the notebook again and you should see a list of parquet files like before.

As we will see later, we specify the same values when setting up the notebook to run as a job.

Issue #2: Files Saved to the Cluster

In order to provide maximum use of the data, let’s write the parquet files to Azure blob storage.

In the same Powershell session you used before, create a storage account and a container named data:

The script outputs the storage key value on line 18. We will need to supply the key in the notebook when writing the parquet files to the container. Update the notebook so it matches the code below. Don’t forget to provide values for your Azure storage account name and key.

We provide the storage connection information to Databricks on line 42. Line 47 provides the path to save the parquet files. Observe that this path looks like a DBFS file path. The dataframe is written to blob storage on line 50.

Run the notebook. When it completes, you should see the files in the demographics folder in the data container.

Now let’s handle the last issue.

Issue #3: Files Are Overwritten On Each Run

Multiple users will be running this query presumably with different counties. Therefore it makes sense to organize the files by state and county. Edit the notebook code so that the files are written to data/[state]/[county] instead.

The county and state names are cleaned up on lines 40 and 41. The relative path [state]/[county] is constructed on line 42.

Run the notebook and you should see files organized by state and county!

Where Are We?

Now we have a notebook that is general enough that it can be turned into a job and run by multiple users using on-demand clusters.

Create the Cluster Pool

On-demand clusters are created when a job runs. As you can imagine, creating a cluster takes time which contributes to the job’s execution time. By creating a cluster pool we can have a set of clusters on stand-by. This reduces the startup time and consequently the overall job execution time.

Let’s use cURL to create a pool using the REST API. Be sure to replace the bearer token and workspace URI with your values.

The above request creates a pool with the following specs:

  • A minimum of two clusters will always be on stand-by.
  • The pool will create a maximum of four clusters.
  • Each cluster will have Spark version 6.6.
  • The third and fourth clusters terminate after 60 minutes of idling.
  • The clusters run Standard_DS3_v2 instances.

If all goes well the pool id will be in the response. At this point go to the Databricks workspace UI, click Clusters, click Pools, and finally click demo-pool. After a few minutes, you should see at least two cluster instances idle.

Create the Job

We are finally ready to create the notebook job! This time we will use the Workspace UI because the Jobs API requires a very verbose body.

  1. Click Jobs.

2. Click Create Job.

3. Enter a job name.

4. Click Select Notebook. Select the notebook you want to run as a job.

5. Click the Edit link next to Parameters.

6. Add the job’s runtime parameters, i.e. the county and state.

7. Click Add.

8. Click Confirm. The parameter values are displayed on the Job page.

8. Click the Edit link next to Cluster.

9. In the Pool dropdown, select demo-pool.

10. In the Databricks Runtime Version dropdown, select Runtime: 6.6

11. Reduce the workers to one.

12. Click Confirm to go back to the Job page.

13. Click Run Now.

After a few seconds, the job status shows as Running. A few minutes later it should report Succeeded.

Cleanup

Please remember to delete your resource group to avoid incurring additional costs:

Wrap-up

That was quite an adventure! You wrote a notebook, tweaked it so that it could be run as a job, and actually ran it as a job. Hopefully, you found this helpful. Thanks for reading!

--

--

Farooq Mahmud
Analytics Vidhya

I am a software engineer at Marel, an Icelandic company that makes machines for meat, fish, and poultry processing.