Build Versatile Pipelines in Elyra using Pipeline Parameters

Published in

Center for Open Source Data and AI Technologies

9 min readJan 4, 2023

The Elyra extension for JupyterLab enables the creation, configuration and submission of machine learning pipelines with its interactive, no-code Visual Pipeline Editor. Based on feature requests received in Elyra’s community channels, we’ve extended the pipeline editor in Elyra version 3.14 to support pipeline parameters for Kubeflow Pipelines. In this article, we’ll cover what pipeline parameters are, why they are useful, and present an example of how to define and use parameters in Elyra pipelines.

Elyra does not currently support pipeline parameters for Apache Airflow runtime environments and, as a result, also does not support them in “Generic” pipelines that can run in either Apache Airflow or Kubeflow Pipelines runtimes.

Pipeline Parameters

Pipeline parameters are typed variables that are defined before a pipeline is run. They allow for the customization of pipeline node inputs without the need to edit the pipeline directly in the pipeline editor. As a trivial example, consider a pipeline that has a node that analyzes a dataset. The node has a single input property, dataset_url, which is a string that represents the remote location of a dataset to be analyzed. This input property directly affects the output of the node and the pipeline as a whole. In this example, the pipeline is designed to be run multiple times, each time with a different value for dataset_url. For example, the pipeline may need to be run each week on data that is stored at a url that self-describes its timestamp information.

To accomplish this without parameters, the pipeline — more specifically, the input to the relevant node — would have to be modified in the pipeline editor and then submitted for execution or exported at least once for each change. This is because values for node inputs cannot be changed once the pipeline is compiled. Of course, this process gets much more complicated for pipelines that have many nodes, each with its own set of properties that need to be customized for each run.

Fortunately, this process is greatly simplified by using parameters. When thinking of a pipeline as being represented as a function, pipeline parameters can be considered as arguments to that function. With this model in mind, a single pipeline can be run multiple times simply by supplying new values for any number of arguments. The below snippet is an example of the code generated by Elyra when selecting “Python DSL” as the export format in the pipeline editor. As you can see, the pipeline quite literally is represented as a function with parameters as its arguments. The value provided as the default for the customizable_url argument is configurable per run.

@kfp.dsl.pipeline(name="modified_demo")
def generated_pipeline(
    customizable_url: str = "https://..."
):
    # Task for node 'Analyze dataset'
    task_774bcb05_a32c_4c = factory_8e4384f422a088e4814024df79(
        dataset_url=customizable_url,
    )

For more information on Elyra’s export mechanism for Kubeflow Pipelines, see this previous blog post.

Parameterizing Pipelines in Elyra

The Elyra Visual Pipeline Editor (VPE) interface includes the palette (on the left), the canvas (in the center), and the properties panel (on the right). The properties panel itself has three tabs: Pipeline Properties, Pipeline Parameters, and Node Properties, all shown below.

Screenshot of the Elyra Visual Pipeline Editor with the open properties panel. — The Elyra Visual Pipeline Editor & panels

Note that the Pipeline Parameters tab in the properties panel will not be present for runtime environments that do not support pipeline parameters, such as Apache Airflow.

In this example, we will create a pipeline that includes both a custom node and a generic node in order to illustrate how parameters work for each type. As shown in the screenshot above, a custom Download data node will download a dataset from a provided url and save it in a shared location, and the generic analyze Notebook node will load and perform analysis on the data saved by its parent node.

Defining Parameters

Our pipeline will require two parameters: one for the custom node and one for the generic node. The custom node has an input property, url, that downloads the data at the specified location via curl. The generic node will accept a parameter that defines a batch size for the data being analyzed. We’ll first define these parameters in the Pipeline Parameters tab of the properties panel. To start, click Add in the parameters tab. The attributes needed to define a single parameter are displayed. For Kubeflow Pipelines, this includes a parameter name, an optional description, the parameter type, an optional default value that can be overridden as needed during pipeline export or submit, and a checkbox to indicate whether the parameter requires a value.

Screenshot of the pipeline parameter panel in the Visual Pipeline Editor. — An empty pipeline parameter (left) and a fully-defined parameter (right)

In the above screenshot (left), you can see that certain validation criteria must be met. For Kubeflow Pipelines, the parameter name must be a valid Python identifier name excluding Python keywords (class, def, global, etc.), as specified in the tooltip for the Name attribute.

The available parameter types are a subset of the Kubeflow Pipelines base types. Elyra currently supports Bool, Float, Integer and String. We’ll start by defining a String-type parameter, customizable_url, to use as input to the custom node’s url property (above screenshot, right). We will also define a second parameter, an Integer-type parameter called batch_size that will be used by the generic node. This parameter will not be required, and we will set a default value of 1000.

Assigning Parameters to Node Inputs

Now it’s time to assign these parameters to be used by their corresponding nodes. First, select the Download data node and open the properties panel to the Node Properties tab. For the Url input property, select the first dropdown box that enumerates the input formats that this property can accept. Choose the “Please select a parameter to use as input” option. The second dropdown box is now populated with a list of compatible parameters that can be used for this input property. Notice that only customizable_url is available to select, even though we also defined the batch_size parameter. This is intentional, as only parameters that are the same type as that defined for the property in that node’s component definition can be selected. In the below combined screenshot, you can see that the Url input type is defined as String, the same type as the parameter. We will select the customizable_url parameter for the Url property and save the pipeline.

A pipeline parameter is associated with a custom node, making its value available to the component that implements the node. — The “Download data” `Url` input property parameter options (left) and the node’s truncated component definition (right)

Parameters are referenced differently for generic nodes. Because generic components are file-based and do not have pre-defined component definitions, parameters are passed to these nodes by setting them as environment variables in the node container. Due to constraints imposed by environment variables, the parameter value will appear as a string when accessed in the generic node regardless of the parameter type that was selected. For that reason, every defined parameter will be available to select in a generic node’s Pipeline Parameters input property, as shown in the screenshot below. We’ll select to pass only the batch_size parameter to this node.

The `batch_size` pipeline parameter is associated with a generic Jupyter notebook node, making its value available as an environment variable. — Generic node properties, including the Pipeline Parameters input property

For runtime environments that do not support pipeline parameters, Pipeline Parameters will not show up in the input properties for generic nodes. Similarly, the “Please select a parameter to use as input” option will not show up in the dropdown for custom node input properties.

Exporting and Submitting Parametrized Pipelines

Now that all the relevant node and pipeline properties have been configured, it is time to prepare the pipeline for execution. Elyra has two options for this: export and submit.

Elyra also supports exporting and submitting pipelines via the CLI. It is currently only possible, however, to customize pipeline parameters using the Visual Pipeline Editor (VPE), as shown in the example here.

In this example, we will illustrate how to customize parameters during pipeline export from the VPE, but the method by which to do so is much the same for pipeline submission. After clicking the export button, the export dialog shows up that includes selections for runtime configuration and export format. We will select Python DSL for this example. As shown in the below screenshot, any parameters that have been referenced by nodes also appear in this dialog box.

Screenshot of the pipeline export dialog. For each parameter the default value is displayed. — Export dialog including pipeline parameters assigned to node properties

Note that, in this case, all parameters that we defined have been referenced by a pipeline node. Any parameters that are defined in the Pipeline Parameters panel but not referenced by any generic or custom node will not show up in the export or submission dialog.

Any default value that has been defined for a parameter will show up for that parameter as a placeholder in the value input box. A value can be entered here to override the defined default value, if desired. In this case, we will enter a new value for customizable_url but keep the default value for batch_size (shown below). Click OK to begin the export process.

Screenshot of the pipeline export dialog. A custom value is assigned to one of the pipeline parameters. — Overriding the default value for the `customizable_url` parameter

If the parameter is marked as required and neither a default value nor a value at submission time is provided, the OK button will be unavailable to click until a value is entered.

Once export has completed, we can view the representation of the parametrized pipeline as Python code. As seen in the snippet below, the value provided during export for customizable_url has replaced the parameter default value, whereas the default value provided for batch_size is used. Each parameter is type-hinted according to the corresponding Kubeflow Pipelines base type indicated for the parameter. Each node is represented by a variable prefixed with task_ . The parameters referenced by each node are passed to the task function as arguments. In order to allow arguments to be passed to generic node tasks, inputs to these tasks are defined in the component definition that appears in the same file.

@kfp.dsl.pipeline(name="parameters-ex")
def generated_pipeline(
    customizable_url: str = "https://my-cloud-storage/data/2022/12/26",
    batch_size: int = 1000,
):

    # Task for node 'Download data'
    task_057d9c11_01f9_462a_be4e_47886f108d8b = (
        factory_bcf3d5779dce44765a282(
            url=customizable_url,
            curl_options="""--location""",
        )
    )

    # Task for node 'analyze'
    task_b7de9c8c_a672_44c4_bb2d_323daaec3660 = (
        factory_9621724082033a96dd013b(
            batch_size=batch_size,
        )
    )

Below is the excerpt of the component definition for the generic node that appears towards the beginning of the generated code. As shown, a separate input is defined for each parameter referenced by the generic node, with the input name having the same name as the name of the parameter.

component_def_9621724082033a96dd013b = """
name: Run a file
description: Run a Jupyter notebook or Python/R script

inputs:
- {name: batch_size, type: Integer, description: 'Number of samples processed before the model is updated', default: 1000, optional: true}

implementation:
  container:
    image: pytorch/pytorch:1.4-cuda10.1-cudnn7-devel
    command: [sh, -c]
    args:
    - |
      batch_size="$0"
      sh -c "mkdir -p ./jupyter-work-dir && cd ./jupyter-work-dir"
      sh -c "echo 'Downloading https://raw.githubusercontent.com/elyra-ai/elyra/main/elyra/kfp/bootstrapper.py' && curl --fail -H 'Cache-Control: no-cache' -L https://raw.githubusercontent.com/elyra-ai/elyra/main/elyra/kfp/bootstrapper.py --output bootstrapper.py"
      sh -c "echo 'Downloading https://raw.githubusercontent.com/elyra-ai/elyra/main/etc/generic/requirements-elyra.txt' && curl --fail -H 'Cache-Control: no-cache' -L https://raw.githubusercontent.com/elyra-ai/elyra/main/etc/generic/requirements-elyra.txt --output requirements-elyra.txt"
      sh -c "echo 'Downloading https://raw.githubusercontent.com/elyra-ai/elyra/main/etc/generic/requirements-elyra-py37.txt' && curl --fail -H 'Cache-Control: no-cache' -L https://raw.githubusercontent.com/elyra-ai/elyra/main/etc/generic/requirements-elyra-py37.txt --output requirements-elyra-py37.txt"
      sh -c "python3 -m pip install  packaging && python3 -m pip freeze > requirements-current.txt && python3 bootstrapper.py --pipeline-name 'parameters-ex' --cos-endpoint 'http://cloning1.fyre.ibm.com:30965' --cos-bucket 'cloning1-test' --cos-directory 'parameters-ex-1229164352' --cos-dependencies-archive 'analyze-b7de9c8c-a672-44c4-bb2d-323daaec3660.tar.gz' --file 'parameters/analyze.ipynb' --pipeline-parameters 'batch_size=$batch_size' --parameter-pass-method 'env' "
    - {inputValue: batch_size}

"""

factory_9621724082033a96dd013b = (
    kfp.components.load_component_from_text(
        component_def_9621724082033a96dd013b
    )
)

Run the generated Python DSL to compile the representation into a YAML file that Kubeflow Pipelines understands. This YAML file can then be uploaded in the Kubeflow Pipelines UI, as we will see next.

Alternatively, this pipeline can be submitted directly to Kubeflow Pipelines from the Elyra pipeline editor to the same effect.

Editing a Parametrized Pipeline in the Kubeflow Pipelines UI

Pipelines can be uploaded to the Kubeflow Pipelines UI in YAML format. The Run parameters section shown below exposes the parameters that were defined for this pipeline. For each pipeline run, new values for each parameter can be supplied directly without the need to modify the pipeline in Elyra’s pipeline editor.

Screenshot of the run wizard in the Kubeflow Pipelines UI — Editing pipeline parameters for a pipeline run

Conclusion

Thank you for reading! You can learn more about our additional capabilities by checking out the published resources page of the Elyra documentation. As an open source project, we welcome community and user involvement of any kind. Find information about how you can reach out with questions and suggestions here, or learn more about contributing here. We look forward to your contributions!