Running notebook pipelines in JupyterLab

Published in

Center for Open Source Data and AI Technologies

6 min readAug 28, 2020

--

In Creating notebook pipelines using Elyra and Kubeflow Pipelines I’ve introduced Elyra’s Notebook Pipelines visual editor and outlined how you can assemble a machine learning workflow pipeline from a set of notebooks and run it on Kubeflow Pipelines. With the release of Elyra v1.1 you can now run the same notebook pipelines in your local JupyterLab environment.

March 2021 update: Elyra v2.1 introduced support for Apache Airflow. You can now run pipelines locally in JupyterLab or remotely on Kubeflow Pipelines and Apache Airflow.
October 2021 update: You can now use custom components in your pipelines.

Pipelines run locally in JupyterLab, or remotely on Kubeflow Pipelines and Apache Airflow.

Even though running notebook pipelines in a local (likely resource constraint) environment has its drawbacks and limits, it can still be a viable solution during development or if you don’t have access to a Kubeflows deployment at all.

If you haven’t read the previous blog post — no worries — all you need to know to get started is covered in this article. If you have read that blog post there are just four things you need to know and you can likely skip the rest of this post: (1) Choose the new Run in-place locally runtime configuration, (2) notebooks are executed within JupyterLab, (3) runtime logs are displayed in the terminal window, and (4) the output artifacts are accessed using the File Browser and not stored on cloud storage.

To assemble a machine learning workflow pipeline you

drag the desired notebooks from the JupyterLab File Browser onto the Elyra pipeline editor canvas,
configure the runtime properties for each notebook, and
connect the notebook nodes as desired to define relationships that govern the order in which the notebooks are executed.

We’ve published a short tutorial in the Elyra examples Github repository that walks you through the steps using the example pipeline shown above.

Configuring notebook nodes

Each node in the pipeline represents a notebook execution. The node’s properties define the runtime configuration for the notebook.

During local execution, notebooks are executed as a sub-process in JupyterLab. However, when you configure a notebook node you must select a runtime image (Docker image) that will be used should you decide to run the pipeline in a remote environment, such as Kubeflow Pipelines.

A notebook might depend on a set of local input files, such as Python scripts, configuration files or data files. You should declare those files as input file dependencies to make the pipeline also runnable in remote environments.

A notebook might also evaluate the value of environment variables, which you can define in the node configuration as necessary.

Last but not least a notebook might produce output files, such as trained model files. You should declare those files as output files to make the pipeline runnable in remote environments.

Let’s illustrate this for the load_data notebook from the flow above, which has no input file dependencies, requires the environment variable DATASET_URL to be defined, and produces a data file named data/noaa-weather-data-jfk-airport/jfk_weather.csv as an output. The following node configuration works just fine during local execution, but would lead to a failure during pipeline execution in a remote environment when the Part 1 — Data Cleaning notebook is run. (Spoiler alert — execution of that node would fail because an input file is not found. You’ll see in a minute why.)

This node configuration for the load_data notebook declares the execution environment and an environment variable, but no explicit input file dependencies or output files.

Running a notebook pipeline locally

To run a pipeline from the pipeline editor click the ▷ (run) button and select the Run in-place locally runtime configuration. (This configuration is only available in Elyra 1.1 and later.)

Choose the “Run in-place locally” runtime configuration to run a notebook pipeline locally in JupyterLab

Each notebook in the pipeline is executed using the kernel that’s specified in the notebook’s kernel spec.

Monitoring a notebook pipeline run

You can monitor the pipeline run progress in the terminal window where JupyterLab is running:

You’ll receive a notification after the pipeline run completed.

Accessing the notebook pipeline output artifacts

During local execution the output cells of the source notebooks are updated and all generated artifacts stored in the local file system. You can therefore access them using the JupyterLab File Browser.

Things to consider if you want to run a pipeline locally and remotely

When you are running a notebook pipeline locally notebooks are executed in a single environment where input and output artifacts are stored in a shared file system. Therefore, if one notebook produces an output artifact that a subsequent notebook requires, it is readily accessible.

For local execution input and output artifacts are stored in a shared local file system and therefore always accessible to each notebook.

When you are running a notebook pipeline remotely, for example on Kubeflow Pipelines, each notebook is processed in an isolated environment — a Docker container. Each environment has access to a shared cloud storage bucket from where input artifacts are downloaded from and output artifacts are saved to. The input file dependencies and output file declarations in the node configurations are used to determine which files need to be “imported” from the cloud storage bucket before the notebook is executed and “exported” to cloud storage after notebook execution has completed.

During remote execution input and output artifacts are imported to and exported from cloud storage to allow for sharing.

In order to make a pipeline runnable locally and in remote environments you therefore have to declare input dependencies and output files in your notebook runtime configurations, as shown in the example below for the load_data notebook.

This node configuration for the load_data notebook declares the execution environment, an environment variable, no explicit input file dependencies and one output file.

Above configuration will yield compatible execution results when the pipeline runs locally or remotely because the same input files and output files are available in both environments.

To learn more about executing pipelines on Kubeflow Pipelines refer to the Creating notebook pipelines using Elyra and Kubeflow Pipelines blog post.

Running your own pipelines

With the built-in support for local pipeline execution you can now run sets of notebooks within JupyterLab. If you’d like to try this out, using the tutorial or your own set of notebooks, you have several options.

Try JupyterLab and Elyra using the pre-built Docker image

The Elyra community publishes ready-to-use Docker images on Docker Hub, which have JupyterLab v2 and the Elyra extension pre-installed. The latest — tagged image is built using the most current published version. Docker images with locked in versions, such as 1.5.2 , are published as well.

$ docker run -it -p 8888:8888 elyra/elyra:latest jupyter lab --debug

and open the displayed URL (e.g. http://127.0.0.1:8888/?token=...) in your web browser.

If you already have notebooks stored on your local file system you should mount the desired directory (e.g. /local/dir/) to make them available.

$ docker run -it -p 8888:8888 -v /local/dir/:/home/jovyan/work -w /home/jovyan/work elyra/elyra:latest jupyter lab --debug

Try JupyterLab and Elyra on Binder

If you don’t have Docker installed or don’t want to download the image (e.g. because of bandwidth constraints) you can try JupyterLab and Elyra in your web browser without having to install anything, thanks to mybinder.org. To do so open the following URL and click the Launch Binder button.

Install JupyterLab and Elyra

If your local environment meets the prerequisites, you can run JupyterLab and Elyra natively on your own machine by following the installation instructions.

How to get involved

The Elyra extension for JupyterLab is maintained by a small group of open source developers and data scientists. We welcome all kinds of contributions, wether it’s feedback, a bug report, improvements to the documentation, or code. Learn more in the Getting Help documentation topic.

Thanks for reading!