How to Run Databricks Notebook in Azure Data Factory / Synapse (Scala Code Samples)

Han Wang
8 min readDec 7, 2021

--

Introduction

In a previous article How to Transform Data with Data Flows in Azure Data Factory / Synapse, I introduced the relatively new feature — Mapping data flows that are available in both Azure data factory and Synapse Analytics.

It provides a visual designing experience with zero coding required. Azure data factory manages the Scala code translation and Spark cluster execution under the hood for you. As awesome and appealing as it is to “citizen developers”, it creates limitations to data / AI engineers.

So what if you want to run your own code and notebooks but still taking advantage of cluster management and task orchestration?

The answer is here.

Data Source and Sink

We are going to continue with the same data source location — Azure data lake storage and data sink location — Azure SQL database.

This is the data format that will be landing in data lake. It’s a csv file with data in both rows and columns.

And below is the final data format we are going to store in the database. It is normalized by un-pivoting the columns into rows.

Azure Databricks Setup

I believe most people that have worked with Spark know Databricks and its cloud platform. If not, here is one-sentence introduction: Databricks is a cloud platform for Spark to provide automated cluster management and IPython-style notebooks.

There are no on-premises or container deployment option currently so you can only use Databricks with one of the three cloud providers — Azure, AWS, and GCP.

There is no upfront cost whatsoever, so you only pay what you use for the compute resources at a per-second granularity. This is much better than most cloud data warehousing services that charge an hourly rate.

Let’s quickly provision a Databricks service on Azure and launch the workspace:

First let’s go to Compute tab and provision our first all-purpose cluster (aka. interactive cluster). The difference between all-purpose and job clusters is that all-purpose clusters can be manually started and terminated and can be used by multiple workloads; whereas job clusters are ephemeral that they are spun up and terminated automatically for specific workloads.

If you are using Python, do not worry: all Databricks runtime pre-installed Python. You can check what’s pre-installed here.

After creation, you need to start the cluster. It usually takes about 5 minutes to spin up the VMs, install all runtime softwares, and make them ready for workloads.

For performance critical workloads, we can use Pools to shorten this “cold start” period. It is basically provision some VMs and let them sit idle. In our case, we do not need this.

Now the clusters are up and running, let’s create our notebook.

Databricks Notebook

Let’s create a notebook under Workspace — Shared folder and attach the cluster to this notebook:

We are going to replicate what we have done in the previous article with ADF’s Mapping data flows. It’s easier to open them side-by-side to compare the steps.

Scala is used for this notebook because we are not going to use any ML libraries in Python for this task and Scala is much faster than Python.

So in the first cell, we are just setting up Spark session, declare variables for the file location, and load the file.

Here we are using a Databricks runtime utility function dbutils.widgets to get the parameters that will be passed in by Azure data factory. During development, we just hardcode the value so the notebook can actually fetch the file.

Let’s run the cell and the data in the csv is loaded into a dataframe.

A caveat to ingest numbers is that if there are commas in the numbers as shown here, Spark won’t recognize it as number and will import as string. So I disabled inferSchema option during loading to increase performance and we will fix the comma issue later on.

Next cell is to un-pivot the columns into rows:

Here is the result:

Next, let’s finalize the format to match the SQL table schema:

Here is the final result of our transformation:

Now let’s use the built-in JDBC connector from Databricks runtime to connect to our Azure SQL database:

Let’s logging into our database to check the result. As you can see, the new data is appended to the end of the table:

The testing looks good. Let’s uncomment the dbutils.widgets commands and comment out the hardcoded lines. Now we are ready to integrate this notebook into Azure data factory.

Don’t forget to terminate the cluster. Although it will automatically shut off after 60-minutes inactivity, we still want to save as much as possible :)

Data Factory Setup

Before leaving the Databricks workspace, let’s go to Settings — User Settings and create an access token for data factory:

If you have not provisioned an Azure data factory, please follow this article How to Transform Data with Data Flows in Azure Data Factory / Synapse to do so.

Now let’s log into the data factory studio. Similar to data lake and sql database, we need to set up a linked service for Azure databricks. Go to Manage — Connections — Linked services tab:

After putting in the Databricks connection details, let’s select new job cluster for the workload. The reason why we use job clusters vs all-purpose / interactive cluster can be found in Databricks best practice:

All-purpose clusters can be shared by multiple users and are best for performing ad-hoc analysis, data exploration, or development. Once you’ve completed implementing your processing and are ready to operationalize your code, switch to running it on a job cluster. Job clusters terminate when your job ends, reducing resource usage and cost.

Data Factory Pipeline

Let’s create a new pipeline and now drag the Notebook task under Databricks to the pipeline canvas. Also create a parameter to receive the file name from the trigger we set up in the previous article.

Select the Notebook task and switch to the Azure Databricks tab to link it to the linked service we just configured:

Go to Settings tab and select the notebook path for this task. Also we need to create the 2 parameters that will be passed into our notebook. Double check the names to avoid misspelling. It’s case sensitive!

For the file name, assign the pipeline parameter as the value for the task parameter:

For the pipelineID, use one of the system variables:

Before publishing the pipeline, let’s add a new trigger as the previous one will only be triggered by Excel files with xlsx extension. Go to Manage — Author — Triggers tab and crate a new trigger:

Jump back to the pipeline and link the trigger:

Tell the trigger what value will be passed from the storage event to the pipeline. We use a built-in property from the event@triggerBody().fileName:

Finally we can publish the pipeline to Azure data factory.

Monitor

Let’s throw a new file to the data source location:

Go to Monitor — Runs — Trigger runs tab and there’s a triggered run:

Click the number 1 under pipelines and it will take us to the pipeline that’s actually running. Click the little glasses icon and it will pop up a page url for the Databricks job.

The page url will take us to the Jobs tab inside Databricks workspace. We can see now as the cluster is being provisioned, it shows that command submitted to cluster for execution.

Once the cluster is up and running, it will execute each cell and have the result right below each cell like what we had in the notebook:

Now let’s log into Azure SQL and see the results:

Thanks for reading and if you have any questions, feel free to reach out to me!

--

--

Han Wang

Full stack developer / cloud architect / data engineer