Best practice of Databricks notebook modulization

Yuheng
4 min readAug 25, 2021

--

Background

When using Databricks, it is quite common that notebooks need to call each other. Unlike in other development environments, the code can be modularized and imported by other code files, in Databricks, how the notebooks are imported? In one of the recent projects, we collaborated with our customers to build a data solution, which consolidates data from multiple on-prem databases to a data platform on Azure. The customer uses Azure Databricks as the data transformation tool. And the business logic is implemented in notebooks. We created some utility notebooks for other notebooks to reuse, and we summarized the best practice and experience of notebook modulization in this blog for your reference.

Challenges and Objectives

To generalize the problem, we created a mock notebook named notebookA (shown as below) with two widgets named “foo” and “foo2” and exit the notebook with the two widget’s value :

#notebookAdbutils.widgets.text("foo", "fooDefault", "fooEmptyLabel")dbutils.widgets.text("foo2", "foo2Default", "foo2EmptyLabel")result = dbutils.widgets.get("foo")+"-"+dbutils.widgets.get("foo2")def display():     print("Function Display: "+result)dbutils.notebook.exit(result)
Figure 1 snapshot of notebookA running result

We will create another notebook named notebookB, which calls notebookA and pass two values via the widgets and then print the output of notebookA. Here is a brief diagram:

Figure 2 Notebooks reference diagram

Solution

There are two methods to run a databricks notebook from another notebook: %run command and dbutils.notebook.run().

1. Method #1 “%run” Command

Usage:

%run <notebook> $parameter_name1 = “value” $parameter_name2 = “value”

#Example1: notebookB1%run ./notebookA $foo="Microsoft" $foo2="Azure"

You can run a notebook from another notebook by using the %run <notebook> magic command. All variables defined in <notebook> become available in your current notebook.

%run must be in a cell by itself, because it runs the entire notebook inline.” — Databricks document

Additionally, it shows how to pass parameters. If there are more than one parameter, whitespace is needed between parameters as above example shown.

NotebookB is in the same folder as notebookA. Below is the snapshot of the result:

Figure 3 snapshot of notebookB method#1 running result

Furthermore, if you want to use function defined in NotebookA, you can directly call it by function name like below:

Figure 4 snapshot of notebookB method#1 function reference running result

2. Method #2 “dbutils.notebook.run()”

Usage:

run(path: String, timeout_seconds: int, arguments: Map): String

#Example2: notebookBdbutils.notebook.run("notebookA", 60, {"foo": "Microsoft","foo2":'Azure'})

Reference: https://docs.microsoft.com/en-us/azure/databricks/notebooks/notebook-workflows#api

In this way, we can run a notebook and return its exit value. The method starts an ephemeral job that runs immediately, which means a new instance of the executed notebook is created and the computations are done within it, in its own scope, and completely aside from the main notebook. This means that no functions and variables you define in the executed notebook can be reached from the main notebook.

Here we create a notebookB1 with the code in example2 which is in the same folder as the notebookA. Below is the snapshot of the result:

Figure 5 snapshot of notebookB1 method#2 running result
Figure 6a snapshot of notebookB method#2 running result

Method #1 or Method #2 ?

The key difference between the 2 approaches is running Notebook in same or different spark job.

If you want to import the entire notebook to use its variables and functions, you can choose the first method by using %run command. Besides the variables and functions in current notebook are also accessible in executed notebook. But also, it may cause the overwritten of variables and functions unintentionally.

If you want to run the notebook in a separate environment without affecting current notebook. the 2nd method will be a better choice. It will not override the variables and functions but cannot import current notebook variables and functions as well unless passing variables explicitly.

To better illustrate the difference above, here is an example to show the differences between method#1 and method#2: if we run notebookB with parameter “foo” defined before calling the notebookA, running different methods, the result will be different:

1) Method#1

In this way, notebookB includes the entire notebookA. So, the value of variable “foo” in notebookB is overwritten by notebookA and becomes “Microsoft”

2) Method#2

In this way, notebookB runs notebookA’s in a separate job. So, the value of variable “foo” in notebookB is not overwritten by notebookA and keep as it was.

Summary

Here is the summary of the difference between the two methods.

--

--