How to Execute a DataBricks Notebook From Another Notebook

What are the ways of executing a notebook from another notebook in DataBricks? And what pros and cons do these approaches have?

Nikola Valesova
DataSentics
5 min readMay 19, 2020

--

Photo by Christopher Gower on Unsplash

When I was learning to code in DataBricks, it was completely different from what I had worked with so far. To me, as a former back-end developer who had always run code only on a local machine, the environment felt significantly different. I used to divide my code into multiple modules and then simply import them or the functions and classes implemented in them. But in DataBricks, as we have notebooks instead of modules, the classical import doesn’t work anymore (at least not yet). But does that mean you cannot split your code into multiple source files? Definitely not! In this post, I’ll show you two ways of executing a notebook within another notebook in DataBricks and elaborate on the pros and cons of each method.

Method #1: %run command

The first and the most straight-forward way of executing another notebook is by using the %run command. Executing %run [notebook] extracts the entire content of the specified notebook, pastes it in the place of this %run command and executes it. The specified notebook is executed in the scope of the main notebook, which means that all variables already defined in the main notebook prior to the execution of the second notebook can be accessed in the second notebook. And, vice-versa, all functions and variables defined in the executed notebook can be then used in the current notebook.

This approach allows you to concatenate various notebooks easily. On the other hand, there is no explicit way of how to pass parameters to the second notebook, however, you can use variables already declared in the main notebook.

Note that %run must be written in a separate cell, otherwise you won’t be able to execute it.

Here is an example of executing a notebook called Feature_engineering, which is located in the same folder as the current notebook:

Example usage of the %run command

In this example, you can see the only possibility of “passing a parameter” to the Feature_engineering notebook, which was able to access the vocabulary_size variable defined in the current notebook. Note also how the Feature_engineering notebook outputs are displayed directly under the command.

Method #2: Dbutils.notebook.run command

The other and more complex approach consists of executing the dbutils.notebook.run command. In this case, a new instance of the executed notebook is created and the computations are done within it, in its own scope, and completely aside from the main notebook. This means that no functions and variables you define in the executed notebook can be reached from the main notebook. On the other hand, this might be a plus if you don’t want functions and variables to get unintentionally overridden.

The benefit of this way is that you can directly pass parameter values to the executed notebook and also create alternate workflows according to the exit value returned once the notebook execution finishes. This comes in handy when creating more complex solutions.

The dbutils.notebook.run command accepts three parameters:

  • path: relative path to the executed notebook
  • timeout (in seconds): kill the notebook in case the execution time exceeds the given timeout
  • arguments: a dictionary of arguments that is passed to the executed notebook, must be implemented as widgets in the executed notebook

Here is an example of executing a notebook called Feature_engineering with the timeout of 1 hour (3,600 seconds) and passing one argument — vocabulary_size representing vocabulary size, which will be used for the CountVectorizer model:

Example usage of the dbutils.notebook.run command

As you can see, under the command appeared a link to the newly created instance of the Feature_engineering notebook. If you click through it, you’ll see each command together with its corresponding output.

Pros and cons

I personally prefer to use the %run command for notebooks that contain only function and variable definitions. In this case, the %run command itself takes little time to process and you can then call any function or use any variable defined in it. This seems similar to importing modules as we know it from classical programming on a local machine, with the only difference being that we cannot “import” only specified functions from the executed notebook but the entire content of the notebook is always imported.

The drawback of the %run command is that you can’t go through the progress of the executed notebook, the individual commands with their corresponding outputs. All you can see is a stream of outputs of all commands, one by one. I find it difficult and inconvenient to debug such code in case of an error and, therefore, I prefer to execute these more complex notebooks by using the dbutils.notebook.run approach.

In DataSentics, some projects are decomposed into multiple notebooks containing individual parts of the solution (such as data preprocessing, feature engineering, model training) and one main notebook, which executes all the others sequentially using the dbutils.notebook.run command.

Summary

Both approaches have their specific advantages and drawbacks. The best practice is to get familiar with both of them, try them out on a few examples and then use the one which is more appropriate in the individual case.

Summary of pros and cons of the individual approaches

Keep in mind that chaining notebooks by the execution of one notebook from another might not always be the best solution to a problem — the more production and large the solution is, the more complications it could cause. On the other hand, both listed notebook chaining methods are great for their ease of use and, even in production, there is sometimes a reason to use them. In larger and more complex solutions, it’s better to use advanced methods, such as creating a library, using BricksFlow, or orchestration in Data Factory.

Thank you for reading up to this point. If you have any further questions or suggestions, feel free to leave a response. Also, if you have a topic in mind that you would like us to cover in future posts, let us know.

--

--

Nikola Valesova
DataSentics

Data scientist, machine learning enthusiast, diversity & inclusion devotee