Creating notebook pipelines using Elyra and Kubeflow Pipelines

--

As a data scientist you are likely using Jupyter notebooks extensively to perform machine learning workflow tasks, such as data exploration, data processing, and model training, evaluation and tuning. Many of those tasks are performed continuously, requiring you to run the notebooks again and again.

The Elyra (https://github.com/elyra-ai/elyra) JupyterLab extension recently introduced the notebook pipelines visual editor, which you can use to create and run pipelines without any coding.

Create a notebook pipeline from a set of notebooks without any coding using Elyra

In version 1.0 Elyra utilizes Kubeflow Pipelines, a popular platform for building and deploying machine learning workflows on Kubernetes, to run the pipelines. Version 1.1 adds support for local pipeline execution, which is convenient if you are just starting to develop your pipeline or process relatively small amounts of data.

March 2021 update: Elyra v2.1 introduced support for Apache Airflow. You can now run pipelines locally in JupyterLab or remotely on Kubeflow Pipelines and Apache Airflow.

Pipelines run locally in JupyterLab, or remotely on Kubeflow Pipelines and Apache Airflow

August 2021 update: Elyra introduced experimental support for custom components. You can now add custom Kubeflow Pipelines components or Apache Airflow operators to pipelines.

Running notebooks as a pipeline

Using Elyra’s visual editor you assemble a pipeline from multiple notebooks and run the pipeline on Kubeflow Pipelines in any Kubernetes environment.

To assemble a pipeline, you drag notebooks from the JupyterLab file browser onto the pipeline editor canvas and connect them as desired. Comment nodes can be added to provide lightweight documentation.

Creating a notebook pipeline using Elyra’s pipeline editor, which downloads a data set, cleanses the data, analyzes the data, and runs predictions.

Notebooks can be arranged to execute sequentially or in parallel. The pipeline depicted above from https://github.com/elyra-ai/examples comprises of four notebook nodes. The “load_data” node is executed first and has three downstream nodes. The “Part 1- Data Cleaning” node is executed after processing of its upstream node “load_data” has successfully completed, and the “Part 2- …” and “Part 3 -…” nodes are executed in parallel after processing of their upstream node “Part 1 -…” has finished.

The following video by Romeo Kienzler illustrates how to analyze COVID-19 time-series data using a notebook pipeline.

Analyzing COVID-19 time-series data using notebook pipelines

Configuring notebook nodes

A notebook node is implemented in Elyra as a Kubeflow Pipelines component (source repository) that uses papermill to run the notebook in a Docker container. The Docker containers do not share resources aside from an S3-compatible cloud storage bucket, which is used to store input and output artifacts.

You configure a notebook nodes in the Elyra pipeline editor by providing the following information:

  • The name and location of the Jupyter notebook.
  • The name of the Docker image that will be used to run the notebook. You can bring your own image or choose from predefined public images, such as a Pandas image, a TensorFlow image (with or without GPU support), or a Pytorch image with the CUDA libraries pre-installed. Any Docker image can be used to run a notebook, as long as curl and Python 3 are preinstalled to allow for automatic scaffolding when the notebook component executes.
  • If your notebook requires access to files that are co-located with the notebook when you assemble a pipeline, such as a set of custom Python scripts, you can declare file dependencies that will be packaged together with the notebook in a compressed archive that is uploaded to the pre-configured cloud storage bucket.
  • You can define environment variables to parametrize the notebook runtime environment.
  • If your notebook produces output files that downstream notebooks consume or that you want to access after the notebook was processed, specify their names. The output files are uploaded to cloud storage after notebook processing has completed.
Sample configuration for the “load_data” notebook. The notebook runs in a Pandas Docker image, requires the DATASET_URL environment variable and produces a CSV output file in the specified location.

Before you can run a Notebook pipeline you have to configure a Kubeflow Pipelines runtime.

Configuring a Kubeflow Pipelines runtime

A runtime configuration defines connectivity information for the Kubeflow Pipelines service instance and S3-compatible cloud storage that Elyra uses to process notebook pipelines:

  • The Kubeflows Pipelines API endpoint, e.g. http://kubernetes-service.ibm.com/pipeline. With the release of version 1.1 Elyra can also utilize multi-user, auth-enabled Kubeflow instances.
  • The cloud storage endpoint, e.g. http://minio-service.kubeflow:9000 .
  • The cloud storage user id, password, and bucket name.

You can manage runtime configurations using the the JupyterLab GUI and the elyra-metadata CLI.

Managing runtime configurations using the GUI

Elyra adds the Runtimes panel to the sidebar in JupyterLab, which you use to add, edit, and remove a configuration.

Access pipeline runtime configurations from the pipeline editor or the side bar

Managing runtime configurations using the CLI

The elyra-metadata CLI can also be used to manage runtime configurations, for example to automate administrative tasks. The examples below illustrate how to add a runtime configuration named kfp_dev_instance, how to list runtime configurations, and how to delete them.

$ elyra-metadata install runtimes --schema_name=kfp  \
--name=kfp_dev_instance \
--display_name="KFP dev instance" \
--api_endpoint=http://.../pipeline \
--cos_endpoint=http://... \
--cos_username=... \
--cos_password=... \
--cos_bucket=...
$ elyra-metadata list runtimes
Available metadata instances for runtimes (includes invalid):
Schema Instance Resource
------ -------- --------
kfp kfp_dev_instance /Users/.../kfp_dev_instance.json
$ elyra-metadata list runtimes --json
[
{
"name": "kfp_dev_instance",
"display_name": "KFP dev instance",
"metadata": {
"api_endpoint": "http://.../pipeline",
"cos_endpoint": "http://...",
"description": "...",
"cos_username": "...",
"cos_password": "...",
"cos_bucket": "..."
},
"schema_name": "kfp",
"resource": "/Users/.../kfp_dev_instance.json"
}
]
$ elyra-metadata remove runtimes --name=kfp_dev_instance
Metadata instance '...' removed from namespace 'runtimes'.

The list commands produces optionally raw JSON output, enabling you to process the results using a command line JSON processor like jq .

Running a notebook pipeline

To run a pipeline from the Pipeline editor click the ▷ (run) button and select the desired runtime configuration.

Elyra generates, gathers, and packages the required artifacts, uploads them to cloud storage and triggers pipeline execution in the selected Kubeflow Pipelines environment.

Monitoring a notebook pipeline run

Elyra currently does not provide pipeline run monitoring capabilities, but you can use the Kubeflow Pipelines UI to check the run status and view the log files. You can access the UI and pipeline output artifacts Runtimes panel by expanding the appropriate runtime configuration entry.

The Runtimes panel in Elyra provides access to the pipeline run history and the cloud storage instance where the run results are stored.

Pipeline runs are listed in the Kubeflow Pipelines UI in the Experiments panel.

Notebook pipeline runs can be accessed in the Experiments panel of the Kubeflows Pipelines UI

The Graph panel in the experiment details page displays the execution status of each (notebook) node.

The graph only displays nodes that have already been processed or are currently running.

Accessing the log file for the load-data node in the Kubeflow Pipelines UI

You can access a node’s execution log by selecting the node and opening the Logs panel.

Generally speaking, each node in your notebook pipeline is processed by the Notebook component in Kubeflow Pipelines as follows:

  • Install the prerequisite Python packages.
  • Download the input artifacts — the notebook and its dependencies — from cloud storage. (The input artifacts are stored in a compressed archive.)
  • Run the notebook using papermill and store the completed notebook as Jupyter notebook file and HTML file
  • Upload completed notebook files to the configured cloud storage bucket
  • Upload configured output files to the configured cloud storage bucket

Note that if processing of a node fails its downstream nodes are not executed.

Accessing the notebook pipeline output artifacts

You can access the pipeline’s output artifacts using any supported S3 client. As shown earlier, the runtime configuration includes a link that you can use to browse the associated cloud storage.

The configured object storage bucket contains the input artifacts and output artifacts; Output artifacts are highlighted in this screen capture in green.

Exporting a notebook pipeline

You can export a notebook pipeline as Kubeflow Pipelines SDK domain specific language (DSL) Python code or a YAML-formatted Kubeflow Pipelines configuration file. During export the input artifacts for each notebook are uploaded to S3-compatible cloud storage.

Export generate all artifacts to run the pipeline in Kubeflow Pipelines

Caution: The exported Python code or configuration file contains connectivity information (including credentials) for the cloud storage location where input artifacts are stored.

To run an exported pipeline from the Kubeflow Pipelines GUI, upload it and create a new run.

Upload and run an exported notebook pipeline in Kubeflows Pipelines

Running your own pipelines

If you have access to a Kubeflows Pipeline service running locally or in the cloud you can start to assemble notebook pipelines in minutes. To get you started we’ve already published a few pipelines. The weather data pipeline we’ve referenced in this blog post is part of the official samples repository, which you can find at https://github.com/elyra-ai/examples. The tutorials provide step-by-step instructions on how to assemble and run this pipeline.

If you are interested in learning how to process COVID-19 time-series data (for the USA and the European Union), take a look at https://github.com/CODAIT/covid-notebooks.

If you don’t have access to a Kubeflows Pipeline service take a look at Running notebook pipelines locally, which outlines how to run notebook pipelines in JupyterLab on your machine.

Try JupyterLab and Elyra using the pre-built Docker image

The Elyra community publishes ready-to-use Docker images on Docker Hub, which have JupyterLab v2 and the Elyra extension pre-installed. The latest — tagged image is built using the most current published version. Docker images with locked in versions, such as 1.1.0 , are published as well.

To use such one of the images run

$ docker run -it -p 8888:8888 elyra/elyra:latest jupyter lab --debug

and open the displayed URL (e.g. http://127.0.0.1:8888/?token=...) in your web browser.

If you already have notebooks stored on your local file system you should mount the desired directory (e.g. /local/dir/) to make them available.

$ docker run -it -p 8888:8888 -v /local/dir/:/home/jovyan/work -w /home/jovyan/work elyra/elyra:latest jupyter lab --debug

Try JupyterLab and Elyra on Binder

If you don’t have Docker installed or don’t want to download the image (e.g. because of bandwidth constraints) you can try JupyterLab and Elyra in your web browser without having to install anything, thanks to mybinder.org. Open https://mybinder.org/v2/gh/elyra-ai/elyra/v1.1.0?urlpath=lab/tree/binder-demo and you are good to go.

Install JupyterLab and Elyra

If your local environment meets the prerequisites, you can run JupyterLab and Elyra natively on your own machine by following the installation instructions.

How to get involved

If you find the visual notebook editor useful, would like to open an enhancement request or report an issue, head on over to https://github.com/elyra-ai/elyra and get the conversation started.

To learn more about how you can contribute to Elyra, take a look at https://github.com/elyra-ai/elyra#contributing-to-elyra.

Thanks for reading!

--

--