Elyra 3.3: Pipelines, custom components, and catalogs

Patrick Titzler
IBM Data Science in Practice
6 min readDec 8, 2021
The sun rises in Twin Lakes, California (Photo by Beate Porst)
Twin Lakes, California (Photo by Beate Porst)

With the release of version 3.3, the Elyra open source community has delivered a major milestone on our roadmap, enabling users to create pipelines using custom components. In this blog post I’ll summarize what custom components are and how to use them.

Version 3.6 includes extended support for Apache Airflow built-in and community operators.

The Visual Pipeline Editor is Elyra’s most prominent feature. It provides JupyterLab users with the means to visually create pipelines and run them on different platforms.

Create pipelines using the Visual Pipeline Editor and run them locally or remotely
Create pipelines using the Visual Pipeline Editor and run them locally or remotely

As of December 2021, Kubeflow Pipelines, Apache Airflow, and local execution are the supported platforms for running pipelines.

Work is ongoing to make it easier for users to “bring your own” runtime platform if none of the existing ones meets their needs.

Even though the Visual Pipeline Editor was originally created to make it easier to create machine learning pipelines from Jupyter notebooks, Python scripts, or R scripts, it is no longer limited to those types of pipelines. With support for custom components, you can now perform many different tasks in pipelines, machine learning specific or not.

Pipelines

A pipeline in Elyra comprises nodes that are connected with each other to define execution dependencies. Each node is implemented by a component and configurable using its public properties. Elyra provides built-in support for three generic components to allow for execution of Jupyter notebooks, Python scripts, and R scripts.

The screen capture below depicts the Visual Pipeline Editor for the generic runtime. The palette (slide out panel on the left) includes only generic components. The canvas (center of screen) is where you assemble the pipeline and the properties view (slide out panel on the right) is used to configure the node runtime properties.

A generic pipeline runs a Jupyter notebook, a Python script, and an R script
A generic pipeline runs a Jupyter notebook, a Python script, and an R script

A great example of an open source project that builds on generic components is CLAIMED — Component Library for AI, Machine Learning, ETL, and Data Science. In a nutshell, CLAIMED provides a set of Jupyter notebooks that implement common tasks, such as data loading, data transformation, or model training. This approach is very popular with users who prefer to use no code or low code tools to work with data.

Generic components are supported for Kubeflow Pipelines, Apache Airflow, and local execution. Other runtime environments that the community might contribute in the future (see earlier comment about ongoing work around “bring your own” runtime platform) are free to choose whether or not to support generic components.

Custom components

Custom components are different from generic components in several respects:

  • Custom components are usually runtime specific. For example a Kubeflow Pipelines component cannot be used in Apache Airflow pipelines.
  • Custom components use runtime specific mechanisms to exchange data with other components (e.g. Xcoms for Apache Airflow), whereas generic components exchange data using S3-compatible storage.
  • Custom components are black boxes. That means the Visual Pipeline Editor exposes their input and output properties but does not necessarily have access to the source code that implements the components’ functionality.
  • Custom components are not included with Elyra and must therefore be managed separately.

The screen capture below depicts the Visual Pipeline Editor for Kubeflow Pipelines. Note the content of the palette on the left, which includes a set of user-provided custom components in the dev components category.

Runtime-specific pipelines allow for running of Jupyter notebooks, Python scripts, and R scripts and custom components
Runtime-specific pipelines allow for running of Jupyter notebooks, Python scripts, and R scripts and custom components

So how do you manage custom components? Read on.

Component catalogs

Elyra defers the task of managing components to entities called catalogs. Any existing storage provider (in a general sense) can be used as a catalog, as long as it provides read access to its resources. The graphic below depicts three provider examples: a filesystem based catalog (components are stored as files in paths), a web based catalog (components are stored as web resources and accessible via URLs), and a service-based catalog (where components might perhaps be stored in a proprietary format).

Examples of storage providers that could be used as catalogs: a filesystem, web resources, and a proprietary service
Examples of storage providers that could be used as catalogs: a filesystem, web resources, and a proprietary service

Elyra’s component registry uses catalog connectors to access catalogs and feeds the Visual Pipeline Editor’s palette. The connectors serve two purposes: retrieve the list of components that a given catalog makes available and fetch the components.

Catalog connectors retrieve components from catalogs and make them available to the registry, which feeds the palette
Catalog connectors retrieve components from catalogs and make them available to the registry, which feeds the palette

Every Elyra installation includes built-in connectors for filesystem and web catalogs, providing access to components that are stored locally or on the web. You can enable support for other catalogs by installing additional connectors, such as the one for the Machine Learning Exchange, and configuring them.

Configuring access to a Machine Learning Exchange deployment
Configuring access to a Machine Learning Exchange deployment

After a catalog connection is configured, the catalog’s components are available in the Visual Pipeline Editor’s palette.

The Visual Pipeline Editor’s palette exposes the content of connected catalogs
The Visual Pipeline Editor’s palette exposes the content of connected catalogs

You add custom components to pipelines just like you would add Jupyter notebooks or scripts: drag the components onto the canvas, connect them to define their execution dependencies, and configure their runtime properties.

Configuring the properties of a custom component
Configuring the properties of a custom component

Earlier I’ve mentioned that custom components use runtime specific mechanisms to exchange data. To provide a uniform experience across different runtimes, the Visual Pipeline Editor imposes a restriction that only allows for data exchange between components that are explicitly connected.

In the following pipeline example Count Rows is connected to Download data and to Truncate File, but not to Download more data. Therefore Count Rows cannot use the output of Download more data as an input source.

Data can only be exchanged between connected custom components
Data can only be exchanged between connected custom components

Sharing components

Pipeline files contain component metadata, such as information about the origin and configured runtime properties, but don’t include the component definition itself. When you share pipeline files with other users, the referenced custom components therefore need to be separately added to their Elyra deployments using catalogs.

Generic components (supporting Jupyter notebooks, Python scripts, and R scripts) are installed in every Elyra deployment. Therefore there is no need to manually “add” them.

The conceptual graphic below illustrates a common Elyra enterprise deployment scenario. The deployment on the left has access to the generic components, access to custom components that are stored in a private catalog, and access to custom components that are stored in a shared catalog. The deployment on the right has access to the generic components and the custom components in the shared catalog.

Shared catalogs provide access to components to multiple Elyra deployments
Shared catalogs provide access to a common set of components to multiple Elyra deployments

A shared catalog is a catalog that can be accessed from multiple Elyra deployments. Examples of sharable catalogs are shared file systems or servers/services that can be connected to remotely using APIs.

Frequently asked questions

Custom components and catalogs are a first step towards better support for “Bring Your Own …” — aka community contributions.

  • Sounds great! How do I get started? Welcome aboard! Install Elyra and try one of the tutorials!
  • I can’t find a catalog connector for XYZ in the marketplace. Can you implement one for me? Unfortunately Elyra has already a large feature request backlog. Therefore we have to rely on community contributions for features (such as connectors) that are not considered core functionality.
  • How do I implement a catalog connector? Please refer to our developer guide or reach out to us using one of the channels listed in the Getting Help topic.
  • I have implemented a catalog connector for XYZ. How do I add it to the marketplace? Fork the examples repository, add a link to your connector repository to the connector list document and open a pull request.
  • I am encountering an issue with catalog connector for XYZ. Where can I get help? Please create an issue in the connector’s repository.

Thanks for reading and your interest in Elyra! Feel free to reach out to us via instant chat, join our weekly community meeting, or open a new discussion thread in our forum if you are interested in helping us improve the custom component support or other aspects of Elyra.

--

--

Patrick Titzler
IBM Data Science in Practice

Developer Advocate at Center for Open-Source Data & AI Technologies