Automating Jupyter notebooks with Papermill

Gleb Ivashkevich
Yandex school of Data Science
5 min readNov 4, 2019
Photo by Daniele Levis Pelusi on Unsplash

Jupyter notebooks are a great way to explore data, test hypotheses, collaborate and report findings. Using a Jupyter notebook, you can have code, images, and nicely formatted descriptions all in one place, greatly simplifying exploratory analysis.

However, when it comes to launching actual long-running computations in a Jupyter notebook, it starts to become tedious.

First of all, notebooks are rarely executed in order. You jump back and forth through a notebook, changing some code here and there until you reach the desired outcome.

Second, Jupyter notebooks, even when cleaned-up for in-order execution, cannot be launched automatically. Instead, you need to open a notebook, launch it manually, and wait for results. If you want to change some parameters inside a notebook, you’ll have to do that manually.

All of this make large-scale execution of multiple notebooks with configurable parameters barely possible. Actual computations are often done with Python scripts, which are basically stripped-down versions of experimental notebooks.

Keeping nearly-duplicate codebase in two separate places is inconvenient. Changes in a notebook must propagate to the corresponding script, and it’s often hard to keep them in sync since Jupyter notebooks are notably hard to diff, as they are just JSON files.

What if you could launch a computation using just the notebook itself with all the configurability you need? Papermill helps to achieve exactly that. By looking introspectively into a notebook and changing it as needed, Papermill allows running configurable computations, all from the command line, with all the automation possible with shell scripting.

Preparing notebooks for automation

In its core, Papermill is a tool to inject configuration into a notebook, launch it and collect the results. However, a notebook must be prepared accordingly, as Papermill executes notebooks in-order. Exploratory notebooks may not be ready for that, as they are often created out-of-order.

Papermill has interfaces for both the command line and Python. We will consider command line use here, although it may be beneficial to use the Python interface to build larger and more elaborated notebook-based pipelines.

Papermill is largely language agnostic, in the sense that it can execute notebooks in any language, provided that Jupyter kernel is available for a language.

Installing Papermill

Installation is straightforward:

You should now be able to launch Papermill:

Example notebook

In the DVC blog post, we created a Python script to create new features. Let’s reimplement it in a notebook. For a reference, original Python implementation was the following:

As you can see, to run this as a usual Python script, we need a CLI, created with argparse (or any other Python tool for CLI generation).

Jupyter version is a bit simpler, as we do not need a CLI:

This notebook can be executed manually, of course. But what if someone needs to change the location of the input or output files? In larger notebooks, it may be even more tedious, as there may be many more parameters to change. Neither is manual launch convenient if you want to check a lot of parameter values.

Papermill solves the problem in a simple and efficient way. Jupyter notebooks allow users to tag cells. Rarely used on its own, this feature makes automation simple: Papermill wants you to tag a cell, which contains parameters with parameters tag.

Tags are displayed in a cell header, with View > Cell Toolbar > Tags:

The notebook with a tagged cell.

After some cell is tagged with parameters, Papermill can inject a new cell after it to overwrite the parameters if needed. The resulting notebook with injected cells is stored in a specified location for further inspection if needed.

Before and after

To launch the notebook with Papermill, we first add data files in the same directory as in the DVC post:

We can now launch the notebook with Papermill:

When running this command, Papermill injected a cell, tagged injected-parameters exactly after our original parameters cell:

After that, the notebook, now named Features-run.ipynb is executed from top to bottom. You can easily parametrize the name of the resulting notebook using usual shell tools:

To launch a notebook for various parameter values, you can use shell for loops or GNU Parallel.

Things to keep eye on

As simple as it looks, Papermill has two important features:

  • any notebook is executed in order, from top to bottom; you should remember, that your notebook must be cleaned-up before pushed to Papermill, as the order you used during exploration may lead to different results, compared to when executing a notebook in-order,
  • Papermill injects new parameter values immediately after the first cell tagged parameters; this may lead to conflicts if you define parameters in a notebook in several cells. For example, derived values may not be updated as needed if they are in the same cell with parameters, or parameter may be overwritten with default if they are spread across several cells.
Parameter cells misarranged.

Rules of thumb are the following:

  • organize your notebooks in such a way, that all the computation may be configured with a (desirably) limited set of parameters,
  • always keep the parameters you want to configure all in a single cell,
  • keep all derived values separately from parameter cells, so that Papermill-injected parameters are propagated appropriately.

Conclusions

Papermill adds new options to how we construct machine learning pipelines. With Papermill we can significantly reduce the time, needed to put some new ML development to real-life use. And as Papermill interplays nicely with other tools like DVC there’s no need to change existing experimentation processes that much.

--

--