Build Versatile Pipelines in Elyra using Pipeline Parameters
The Elyra extension for JupyterLab enables the creation, configuration and submission of machine learning pipelines with its interactive, no-code Visual Pipeline Editor. Based on feature requests received in Elyra’s community channels, we’ve extended the pipeline editor in Elyra version 3.14 to support pipeline parameters for Kubeflow Pipelines. In this article, we’ll cover what pipeline parameters are, why they are useful, and present an example of how to define and use parameters in Elyra pipelines.
Elyra does not currently support pipeline parameters for Apache Airflow runtime environments and, as a result, also does not support them in “Generic” pipelines that can run in either Apache Airflow or Kubeflow Pipelines runtimes.
Pipeline Parameters
Pipeline parameters are typed variables that are defined before a pipeline is run. They allow for the customization of pipeline node inputs without the need to edit the pipeline directly in the pipeline editor. As a trivial example, consider a pipeline that has a node that analyzes a dataset. The node has a single input property, dataset_url
, which is a string that represents the remote location of a dataset to be analyzed. This input property directly affects the output of the node and the pipeline as a whole. In this example, the pipeline is designed to be run multiple times, each time with a different value for dataset_url
. For example, the pipeline may need to be run each week on data that is stored at a url that self-describes its timestamp information.
To accomplish this without parameters, the pipeline — more specifically, the input to the relevant node — would have to be modified in the pipeline editor and then submitted for execution or exported at least once for each change. This is because values for node inputs cannot be changed once the pipeline is compiled. Of course, this process gets much more complicated for pipelines that have many nodes, each with its own set of properties that need to be customized for each run.
Fortunately, this process is greatly simplified by using parameters. When thinking of a pipeline as being represented as a function, pipeline parameters can be considered as arguments to that function. With this model in mind, a single pipeline can be run multiple times simply by supplying new values for any number of arguments. The below snippet is an example of the code generated by Elyra when selecting “Python DSL” as the export format in the pipeline editor. As you can see, the pipeline quite literally is represented as a function with parameters as its arguments. The value provided as the default for the customizable_url
argument is configurable per run.
@kfp.dsl.pipeline(name="modified_demo")
def generated_pipeline(
customizable_url: str = "https://..."
):
# Task for node 'Analyze dataset'
task_774bcb05_a32c_4c = factory_8e4384f422a088e4814024df79(
dataset_url=customizable_url,
)
For more information on Elyra’s export mechanism for Kubeflow Pipelines, see this previous blog post.
Parameterizing Pipelines in Elyra
The Elyra Visual Pipeline Editor (VPE) interface includes the palette (on the left), the canvas (in the center), and the properties panel (on the right). The properties panel itself has three tabs: Pipeline Properties
, Pipeline Parameters
, and Node Properties
, all shown below.
Note that the
Pipeline Parameters
tab in the properties panel will not be present for runtime environments that do not support pipeline parameters, such as Apache Airflow.
In this example, we will create a pipeline that includes both a custom node and a generic node in order to illustrate how parameters work for each type. As shown in the screenshot above, a custom Download data
node will download a dataset from a provided url and save it in a shared location, and the generic analyze
Notebook node will load and perform analysis on the data saved by its parent node.
Defining Parameters
Our pipeline will require two parameters: one for the custom node and one for the generic node. The custom node has an input property, url
, that downloads the data at the specified location via curl
. The generic node will accept a parameter that defines a batch size for the data being analyzed. We’ll first define these parameters in the Pipeline Parameters
tab of the properties panel. To start, click Add
in the parameters tab. The attributes needed to define a single parameter are displayed. For Kubeflow Pipelines, this includes a parameter name, an optional description, the parameter type, an optional default value that can be overridden as needed during pipeline export or submit, and a checkbox to indicate whether the parameter requires a value.
In the above screenshot (left), you can see that certain validation criteria must be met. For Kubeflow Pipelines, the parameter name must be a valid Python identifier name excluding Python keywords (class
, def
, global
, etc.), as specified in the tooltip for the Name
attribute.
The available parameter types are a subset of the Kubeflow Pipelines base types. Elyra currently supports Bool
, Float
, Integer
and String
. We’ll start by defining a String
-type parameter, customizable_url
, to use as input to the custom node’s url
property (above screenshot, right). We will also define a second parameter, an Integer
-type parameter called batch_size
that will be used by the generic node. This parameter will not be required, and we will set a default value of 1000
.
Assigning Parameters to Node Inputs
Now it’s time to assign these parameters to be used by their corresponding nodes. First, select the Download data
node and open the properties panel to the Node Properties
tab. For the Url
input property, select the first dropdown box that enumerates the input formats that this property can accept. Choose the “Please select a parameter to use as input” option. The second dropdown box is now populated with a list of compatible parameters that can be used for this input property. Notice that only customizable_url
is available to select, even though we also defined the batch_size
parameter. This is intentional, as only parameters that are the same type as that defined for the property in that node’s component definition can be selected. In the below combined screenshot, you can see that the Url
input type
is defined as String
, the same type as the parameter. We will select the customizable_url
parameter for the Url
property and save the pipeline.
Parameters are referenced differently for generic nodes. Because generic components are file-based and do not have pre-defined component definitions, parameters are passed to these nodes by setting them as environment variables in the node container. Due to constraints imposed by environment variables, the parameter value will appear as a string when accessed in the generic node regardless of the parameter type that was selected. For that reason, every defined parameter will be available to select in a generic node’s Pipeline Parameters
input property, as shown in the screenshot below. We’ll select to pass only the batch_size
parameter to this node.
For runtime environments that do not support pipeline parameters,
Pipeline Parameters
will not show up in the input properties for generic nodes. Similarly, the “Please select a parameter to use as input” option will not show up in the dropdown for custom node input properties.
Exporting and Submitting Parametrized Pipelines
Now that all the relevant node and pipeline properties have been configured, it is time to prepare the pipeline for execution. Elyra has two options for this: export and submit.
Elyra also supports exporting and submitting pipelines via the CLI. It is currently only possible, however, to customize pipeline parameters using the Visual Pipeline Editor (VPE), as shown in the example here.
In this example, we will illustrate how to customize parameters during pipeline export from the VPE, but the method by which to do so is much the same for pipeline submission. After clicking the export button, the export dialog shows up that includes selections for runtime configuration and export format. We will select Python DSL for this example. As shown in the below screenshot, any parameters that have been referenced by nodes also appear in this dialog box.
Note that, in this case, all parameters that we defined have been referenced by a pipeline node. Any parameters that are defined in the
Pipeline Parameters
panel but not referenced by any generic or custom node will not show up in the export or submission dialog.
Any default value that has been defined for a parameter will show up for that parameter as a placeholder in the value input box. A value can be entered here to override the defined default value, if desired. In this case, we will enter a new value for customizable_url
but keep the default value for batch_size
(shown below). Click OK
to begin the export process.
If the parameter is marked as required and neither a default value nor a value at submission time is provided, the
OK
button will be unavailable to click until a value is entered.
Once export has completed, we can view the representation of the parametrized pipeline as Python code. As seen in the snippet below, the value provided during export for customizable_url
has replaced the parameter default value, whereas the default value provided for batch_size
is used. Each parameter is type-hinted according to the corresponding Kubeflow Pipelines base type indicated for the parameter. Each node is represented by a variable prefixed with task_
. The parameters referenced by each node are passed to the task function as arguments. In order to allow arguments to be passed to generic node tasks, inputs to these tasks are defined in the component definition that appears in the same file.
@kfp.dsl.pipeline(name="parameters-ex")
def generated_pipeline(
customizable_url: str = "https://my-cloud-storage/data/2022/12/26",
batch_size: int = 1000,
):
# Task for node 'Download data'
task_057d9c11_01f9_462a_be4e_47886f108d8b = (
factory_bcf3d5779dce44765a282(
url=customizable_url,
curl_options="""--location""",
)
)
# Task for node 'analyze'
task_b7de9c8c_a672_44c4_bb2d_323daaec3660 = (
factory_9621724082033a96dd013b(
batch_size=batch_size,
)
)
Below is the excerpt of the component definition for the generic node that appears towards the beginning of the generated code. As shown, a separate input
is defined for each parameter referenced by the generic node, with the input name having the same name as the name of the parameter.
component_def_9621724082033a96dd013b = """
name: Run a file
description: Run a Jupyter notebook or Python/R script
inputs:
- {name: batch_size, type: Integer, description: 'Number of samples processed before the model is updated', default: 1000, optional: true}
implementation:
container:
image: pytorch/pytorch:1.4-cuda10.1-cudnn7-devel
command: [sh, -c]
args:
- |
batch_size="$0"
sh -c "mkdir -p ./jupyter-work-dir && cd ./jupyter-work-dir"
sh -c "echo 'Downloading https://raw.githubusercontent.com/elyra-ai/elyra/main/elyra/kfp/bootstrapper.py' && curl --fail -H 'Cache-Control: no-cache' -L https://raw.githubusercontent.com/elyra-ai/elyra/main/elyra/kfp/bootstrapper.py --output bootstrapper.py"
sh -c "echo 'Downloading https://raw.githubusercontent.com/elyra-ai/elyra/main/etc/generic/requirements-elyra.txt' && curl --fail -H 'Cache-Control: no-cache' -L https://raw.githubusercontent.com/elyra-ai/elyra/main/etc/generic/requirements-elyra.txt --output requirements-elyra.txt"
sh -c "echo 'Downloading https://raw.githubusercontent.com/elyra-ai/elyra/main/etc/generic/requirements-elyra-py37.txt' && curl --fail -H 'Cache-Control: no-cache' -L https://raw.githubusercontent.com/elyra-ai/elyra/main/etc/generic/requirements-elyra-py37.txt --output requirements-elyra-py37.txt"
sh -c "python3 -m pip install packaging && python3 -m pip freeze > requirements-current.txt && python3 bootstrapper.py --pipeline-name 'parameters-ex' --cos-endpoint 'http://cloning1.fyre.ibm.com:30965' --cos-bucket 'cloning1-test' --cos-directory 'parameters-ex-1229164352' --cos-dependencies-archive 'analyze-b7de9c8c-a672-44c4-bb2d-323daaec3660.tar.gz' --file 'parameters/analyze.ipynb' --pipeline-parameters 'batch_size=$batch_size' --parameter-pass-method 'env' "
- {inputValue: batch_size}
"""
factory_9621724082033a96dd013b = (
kfp.components.load_component_from_text(
component_def_9621724082033a96dd013b
)
)
Run the generated Python DSL to compile the representation into a YAML file that Kubeflow Pipelines understands. This YAML file can then be uploaded in the Kubeflow Pipelines UI, as we will see next.
Alternatively, this pipeline can be submitted directly to Kubeflow Pipelines from the Elyra pipeline editor to the same effect.
Editing a Parametrized Pipeline in the Kubeflow Pipelines UI
Pipelines can be uploaded to the Kubeflow Pipelines UI in YAML format. The Run parameters
section shown below exposes the parameters that were defined for this pipeline. For each pipeline run, new values for each parameter can be supplied directly without the need to modify the pipeline in Elyra’s pipeline editor.
Conclusion
Thank you for reading! You can learn more about our additional capabilities by checking out the published resources page of the Elyra documentation. As an open source project, we welcome community and user involvement of any kind. Find information about how you can reach out with questions and suggestions here, or learn more about contributing here. We look forward to your contributions!