Introducing Elyra pipelines with custom component support

Patrick Titzler
IBM Data Science in Practice
5 min readAug 5, 2021
a set of verdant hills that drop straight into the ocean

The Elyra open source project for JupyterLab aims to simplify common data science tasks. Its most popular feature is the Visual Pipeline Editor, which is used to create machine learning pipelines without the need for coding. You can run these pipelines in JupyterLab or on Kubeflow Pipelines or Apache Airflow.

a screen pane showing a pipeline with “download data” that feeds to “Part 1 — Dataset cleaning” that then feeds to both Part 2 — Dataset Analysis and Part 3 — Predictions. The pane then has three arrows pointing to boxes with the text “local execution in Jupyter Lab”, “Kubeflow Pipelines”, and “apache airflow”
Pipelines run locally in JupyterLab or remotely on Kubeflow Pipelines and Apache Airflow.

Elyra 3.0 extends the pipeline capabilities by adding experimental support for runtime specific components. Before I dive into specifics and outline why support is still experimental in the initial releases, let’s recap a few concepts.

October 2021: This blog post was updated to include version 3.1 and 3.2 enhancements. A tutorial for Kubeflow Pipelines is now available. Give it a try!

A pipeline comprises nodes that are connected to form a graph. The graph defines dependencies between the nodes, governing the order in which the nodes are executed. The example pipeline shown below executes a Python script and several Jupyter notebooks.

A closeup of the pipeline showing that each part — Downloading data, Part 1, Part 2, and Part 3 as nodes, and the lines connecting them showing dependencies
This pipeline downloads a data set and runs three notebooks in the specified sequence.

Nodes are implemented using components. To create the pipeline shown above, you’ll need components that can execute Python scripts and Jupyter notebooks. Most components are configurable to make them reusable. For file-based components, such a configuration might include the file name and the container image where the file is executed in.

Adding to the closeup of the pipeline. Each node has an icon next to it that indicates where the component represented by the node runs. In the case of the download data node or component, it is a component that runs a Python script. The other nodes have an icon that indicates the components run on a Jupyter notebook
Nodes are implemented by components.

In Elyra, the processing of Jupyter notebooks, Python scripts, and R scripts is implemented using a single component. This component is referred to as a generic component because it is supported in all runtime environments.

The pipeline editor then exposes this component under different names in the palette, which is located on the left-hand side. (You can add nodes to the pipeline by selecting a component from the palette and dropping it on the canvas.)

The pipeline as shown above, in a screen and a left hand pane is shown showing where generic components are listed by type: Notebook, Python, and R.
The generic pipeline editor. Only generic components are supported to make pipelines runnable in JupyterLab, Kubeflow Pipelines, and Apache Airflow.

Pipelines that only include generic components are referred to as generic pipelines because you can run them in any runtime environment Elyra supports.

Take a look at the tutorials if you are new to Elyra and learn more about using the Visual Pipeline Editor to create a pipeline. If you’ve used Elyra before, we recommend reviewing the recently published Best practices topic in the User Guide. We’ve only now gotten around to documenting some of the things that make your life easier!

Experimental support for custom components

Custom components are similar to generic components in that they only implement a single task, such as load data, train a model, or send an email. However, these components are only supported for Kubeflow Pipelines and Apache Airflow and are implemented in a runtime specific form.

Information about custom components is stored in a local registry, which is exposed in the pipeline editor palette. You can add components that you’ve created or third-party components that others published.

The screen capture below depicts the pipeline editor for Apache Airflow pipelines. The palette, shown on the left, is by default divided into two categories — one for generic components and one for custom components. Note the Airflow specific components in the second category, such as the BashOperator and the SimpleHttpOperator, which process a bash command and an HTTP request, respectively.

The pipeline as shown above, in a screen and now the left hand pane shows the generic components and now a list of custom components with Apache Airflow node show types BashOperator, Email Operator, SimpleHttpOperator, SparkSqlOperator, SparkSubmitOperator, SlackAPIOperator, and SlackAPIPostOperator
The Apache Airflow pipeline editor.

Pipelines that utilize custom components are called runtime specific pipelines because it is not possible to run a pipeline that was created for Kubeflow Pipelines on Apache Airflow and vice versa.

Get started with pipelines

Once you’ve installed Elyra, it is easy to get started with pipelines.

If you are using Kubeflow Pipelines as your runtime environment, take a look at the tutorial. It guides you through the process of adding custom Kubeflow Pipelines components to the registry and how to configure them in the Visual Pipeline Editor.

The JupyterLab launcher now includes tiles for each pipeline type: one for generic pipelines, and one for each supported pipeline runtime platform.

close up of Elyra interface showing the three editors: the generic pipeline editor, the kubeflow pipeline editor, and the apache airflow pipeline editor
The Elyra category now provides access to three pipeline editors

Click the desired pipeline editor tile and you are ready to compose a pipeline from the components that are supported for the selected platform.

To get you going quickly, the component registry also includes a few example custom components for each runtime platform. The Elyra examples GitHub repository includes information about those components and pipelines that illustrate their usage. These components are included for illustrative purposes only. Unless stated otherwise, the components were not created by the Elyra community and are therefore provided as is.

Opportunities for growth

Some custom component features are still under development, planned for a future release, or in the backlog without a specific target release. Some of the high priority features for the next releases are:

  • Data exchange between custom Apache Airflow components: Components commonly produce outputs that other components require as input. Currently, custom components are isolated from each other and cannot exchange data. (Data exchange between generic components and custom Kubeflow Pipelines components is already supported.)
  • Data exchange between generic components and custom components: Same as above.
  • Support for remote component repositories: The component registry only stores references to custom component specifications but not the specifications themselves. In version 3.2 the local file system and public web resources are locations from where the registry can pull those specifications. It’s our goal to also enable support for third-party component repositories, such as the Machine Learning Exchange or GitHub repositories.

For an up-to-date feature status please refer to this forum thread.

Use Watson Studio services in pipelines

Pipelines can also take advantage of external services using custom components. If you are looking for a managed solution for Watson Studio services, check out this IBM Watson Studio Pipeline article. It illustrates how to run notebooks, refine data, run AutoAI experiments, and deploy a model.

Screenshot of IBM Cloud Pak for Data with the services in the secreen of Copy Assets, Run notebook, Run AutoAI experiment, Select winning notebook, Create web service, and Export assets. The left hand pane shows Copy, Run, Create, Update, and Delete.
Watson Studio Orchestration Flow pipeline that utilizes Watson Studio services

Your opportunity to help us improve Elyra

Elyra is a fairly new open source project that is currently maintained by a small community of JupyterLab enthusiasts. We welcome contributions of any kind, such as feedback, bug reports, bug fixes, features, or documentation. To learn more about how you can make a difference refer to the Getting help topic in the documentation.

On behalf of the community: Thank You!

--

--

Patrick Titzler
IBM Data Science in Practice

Developer Advocate at Center for Open-Source Data & AI Technologies