Automate Elyra tasks using the CLI

Sunset over Lake Pontchartrain
Lake Pontchartrain, Louisiana (Photo by Rachel Boykin)

The Elyra JupyterLab extension makes it easy to create, manage, and run pipelines and supporting assets such as runtime configurations and runtime images with an intuitive GUI. The Visual Pipeline Editor enables these common data science tasks without the need for coding.

The Elyra Visual Pipeline Editor with 4 nodes
The Visual Pipeline Editor

For users that prefer a CLI or in situations where a set of otherwise tedious tasks need to be automated, the features and options available in Elyra’s CLI can be exceptionally useful. The Elyra CLI can be used to manage metadata (elyra-metadata) and work with pipelines (elyra-pipeline). These tools are part of the Elyra installation and can be used without a running JupyterLab instance. In this blog post, I’ll go over some of the available subcommands for each and cover some examples in the process.

Managing Elyra Metadata

In order to facilitate building and running pipelines, Elyra needs to store certain information such as runtime configurations, runtime images, component catalogs, and code snippets. The elyra-metadata command can be used to modify this metadata using the available subcommands: list, create, update, import, export and remove.

Currently, there are four types of metadata that Elyra stores and that can be modified using the CLI: code-snippets, component-catalogs, runtimes, and runtime-images. These types are also referred to as schemaspaces. When running any elyra-metadata subcommand, the schemaspace name must be provided. A quick explanation of each schemaspace is shown in the table below.

A table explaining the purpose of the 4 schemaspaces
Elyra schemaspaces and their purposes

The structure of an elyra-metadata command follows the format:

$ elyra-metadata SUBCOMMAND SCHEMASPACE [OPTIONS]

Run elyra-metadata and its subcommands with the --help flag at any time for a summary of the usage and available subcommands covered below.

Listing Metadata

The list subcommand lists all the instances of the specified schemaspace. For example, run the command below to see a list of all installed runtime images.

$ elyra-metadata list runtime-images
Output of the `list` subcommand for the runtime-images schemaspace
Output of the `list` subcommand for the runtime-images schemaspace

Note that the list subcommand will also display any invalid instances that may be present in the Elyra metadata storage folder, so there may be more results than expected when compared to instances that show up in the GUI. Invalid instances are indicated in the output of list with **INVALID** on the relevant row. Use the --valid-only flag to ignore invalid instances, if desired.

You can also see metadata instances in their raw JSON form by supplying the --json flag to the above command. Below is the representation for an image with name new-pytorch-runtime:

JSON representation of a single runtime image
JSON representation of a single runtime image

The structure of an instance can be helpful to have handy when adding or modifying an instance, as it hints at which fields may need to be supplied to the create or update subcommands.

Creating Metadata

When using the CLI to create a new instance, the actual command used to create the instance in the screen capture displayed above would look something like this:

$ elyra-metadata create runtime-images \
--name "new-pytorch-runtime" \
--display_name "Pytorch 1.4 with CUDA-runtime" \
--description "PyTorch 1.4 (with GPU support)" \
--image_name "pytorch/pytorch:1.4-cuda10.1-cudnn7-runtime"

For any subcommand, if a required parameter is not included or is improperly formatted, the output of that command will display an appropriate error message.

Error output showing a missing parameter and other option information
Error output showing a missing parameter and other option information

Two additional parameters are available for the create subcommand:

  • --file, which takes a file path to a JSON-formatted file that contains values for the properties of a metadata instance
  • --json, which takes a JSON string that contains values for the properties of a metadata instance

These options will be covered in more details in the following subsection.

Updating Metadata

The update subcommand modifies an existing instance and is similar to create in structure. Note that in the case of updates only the name parameter and the parameters for the fields that are being modified need to be supplied. The value of the name parameter is that displayed in the Instance column when using the list subcommand or that shown in the name field of the JSON representation of an instance. For example, to change the pull policy for the same example image:

$ elyra-metadata update runtime-images \
--name "new-pytorch-runtime" \
--pull_policy "Never"

You can review the output of the elyra-metadata list runtime-images --json command again to confirm that the pull_policy field is now set to “Never”.

JSON representation of the updated runtime-image
JSON representation of the updated runtime-image

The resource_name fields of a metadata instance should not be modified in most cases. The display_name field and any field within the metadata stanza, however, are freely modifiable.

The examples shown above all focus on the runtime-images schemaspace, the instances of which are quite simple. As seen in the JSON representation of a runtime image, there are only a handful of properties saved for each. In the case where the instances are more complicated, however, such as with the runtimes schemaspace, the --file and --json options can greatly simplify the creation or update of instances.

The --file option takes a file path to a JSON-formatted file that contains either the entire JSON representation of an instance (that shown with the list --json subcommand) or only the JSON that comprises the metadata stanza. For example, the file shown in the below picture represents a metadata stanza of an instance in the runtime-images schemaspace.

JSON representation of runtime image metadata within a file

Run the following command to replace the pull_policy value with Always for the instance named new-pytorch-runtime:

$ elyra-metadata update runtime-images \
--name "new-pytorch-runtime" \
--file /path/to/metadata.json

Review the output of the elyra-metadata list runtime-images --json command, if desired, to confirm that the appropriate change has been made. Any value supplied in the file can be overridden with a value given on the command line if the appropriate option (e.g. --image_name) is supplied. The command below changes the pull_policy to Always(as defined in the file) and changes the image_name to the image at my_repo/pytorch:1.4-cuda10.1-cudnn7-runtime, even though the image_name given in the file is different.

$ elyra-metadata update runtime-images \
--name "new-pytorch-runtime" \
--file /path/to/metadata.json \
--image_name "my_repo/pytorch:1.4-cuda10.1-cudnn7-runtime"

The --json option works similar to --file, but allows the specification of the bulk JSON to be referenced as a string. Its behavior is the same as --file relative to the expected data and ability to override. This option can be helpful in situations where file creation is not available or the metadata is smaller.

$ elyra-metadata update runtime-images \
--name "new-pytorch-runtime" \
--json '{"description": "Pytorch 1.4 CUDA from my_repo"}'

Removing Metadata

Remove an instance is as simple as supplying the name parameter:

$ elyra-metadata remove runtime-images --name "my_image_name"

Exporting Metadata

As of version 3.8, Elyra also supports bulk export and import of instances using the elyra-metadata command. Export of metadata instances allows for improved testing, simplified cloning of deployments, and ease of making backups for disaster recovery. A directory must be supplied when exporting:

$ elyra-metadata export runtime-images --directory "/tmp/foo"
Output of the elyra-metadata `export` subcommand
Output of the elyra-metadata `export` subcommand

As noted in the output, instances are copied into a subfolder named for the schemaspace of interest within the given directory: /tmp/foo/runtime-images for the example above.

Note that the exported metadata is not encrypted and, therefore, should be stored in an access-restricted location.

Two additional flags are available to work with export:

  1. --include-invalid: include invalid instances when exporting
  2. --clean: clean out the directory for the given schemaspace before exporting the instances; this empties the /tmp/foo/runtime-images directory before exporting the current instances

Importing Metadata

Metadata instance import works in a similar manner and has analogous use cases. The below command imports all runtime image instances from the /tmp/foo/runtime-images directory that was created during the previous export command into the applicable metadata folder for access by Elyra:

$ elyra-metadata import runtime-images \
--directory "/tmp/foo/runtime-images"

Note that any subdirectories present in the supplied directory will be ignored.

In the case that an imported instance has the same name as an instance that is already present in the Elyra metadata folder, the existing instance will be preferred. The --overwrite flag can be used to signal to the tool that imported instances should overwrite those already present in the case of a name collision.

Example Use Case

Now that we’ve covered how to use the metadata CLI tool to modify metadata, we can move into the why. One particularly helpful use for elyra-metadata is to use it for environment set up — for example, in a Dockerfile. In fact, Elyra uses this strategy in building its own container image. Below is an excerpt from Elyra’s Dockerfile. This is a relatively simple example, but it can be easily extended to include a set of custom runtime configurations, runtime images, or additional component catalogs that you might require in your environment. This eliminates the need to manually create these metadata instances from the GUI at every start up.

Install Elyra and all optional dependencies, then create the component-catalog instance that makes the Kubeflow Pipelines example components available

The all-inclusive stand-alone installation (e.g. pip install elyra[all]) of Elyra includes the Kubeflow Pipelines example components, which provide an illustration of how custom components and component catalogs work. The RUN elyra-metadata command in the excerpt above ensures that these example components are available in Elyra when the container is run.

Working with Elyra Pipelines

Once the metadata groundwork has been laid, the elyra-pipeline command can be used to perform pipeline tasks. The following subcommands are available for this purpose: submit, run, validate, describe, and export. The $ elyra-pipeline --help command provides a summary of the usage and meaning of these subcommands.

Consider reading this page in the Elyra documentation for a more in-depth discussion of some of the pipeline-related concepts referenced here.

Describing Pipelines

The describe and validate subcommands are best used before the other three subcommands in order to check that pipelines are valid and that all the necessary prerequisites for pipeline execution are in place. The describe subcommand enables the latter by providing a short summary of the given pipeline. Among other things, this summary lists any file and component dependencies that will be referenced during pipeline construction or execution. A path to a valid pipeline file must be supplied to the describe subcommand:

$ elyra-pipeline describe /path/to/read_and_write.pipeline

All elyra-pipeline subcommands can accept either a relative or absolute path to a pipeline file. If using a relative path, the path must be relative to the active Jupyter working directory in order to succeed.

The output of describe is shown below for the read_and_write pipeline. As shown, the pipeline has one file dependency, one_line_file.txt, meaning that this file must exist and be co-located with the pipeline file whenever it is run or submitted for remote execution.

Output of the elyra-pipeline `describe` subcommand
Output of the elyra-pipeline `describe` subcommand

The above information can also be shown as JSON output, if desired, by supplying the --json flag in the describe subcommand.

Validating Pipelines

The validate subcommand is used to run a full validation check on a given pipeline. While validation always occurs before pipeline run, submission or export, it can be useful to run validation separately in a “dry run” scenario as well:

$ elyra-pipeline validate /path/to/read_and_write.pipeline

If validation fails, a relevant message is displayed for each error encountered, and the command returns with a non-zero exit code.

Note that this exit code rule applies to all Elyra CLI commands. An exit code of 0 indicates success, and a non-zero exit code indicates failure.

Consider a case where the above command is run from a directory that is not the active Jupyter working directory. The output will look something like the following:

Output of elyra-pipeline `validate` command with failures
Output of elyra-pipeline `validate` command with failures

As noted in the validation errors listed above, Elyra can not find the notebook and script node files and their related file dependencies from this directory. After changing into the correct root directory and re-running the command, validation passes. Now the pipeline can be successfully run, submitted, or exported.

Running Pipelines

The run subcommand executes a generic pipeline in JupyterLab. Log messages are displayed indicating the progress of node processing.

$ elyra-pipeline run /path/to/read_and_write.pipeline

Note that pipelines that include runtime-specific components can not be executed locally. Validation will fail indicating that a runtime configuration must be supplied or any runtime-specific components removed from the pipeline.

Submitting Pipelines

Where the run subcommand is used for local execution, the submit subcommand is the analogous command for remote pipeline execution. Elyra currently provides built-in support for Kubeflow Pipelines and Apache Airflow as remote execution engines. The submit command therefore requires that a runtime configuration metadata instance name be supplied as a parameter. Use the elyra-metadata list runtimes command to see the available runtime configurations.

Available runtime configurations given by the elyra-metadata `list` command

The Schema column indicates the execution engine to which this instance applies, and the Instance column indicates the name of the instance. The latter is the value that needs to be supplied to the submit subcommand.

$ elyra-pipeline submit /path/to/read_and_write.pipeline \
--runtime-config kfp_dev_env

If the specified --runtime-config is not compatible with the specified pipeline, an error is raised during validation. If the pipeline was successfully submitted, however, the command returns a link that you can use to monitor the progress of the pipeline and a link to the s3-based cloud storage where the pipeline run artifacts are stored.

Pipeline submissions to a Kubeflow Pipelines runtime can also be monitored from the command line. The --monitor option is used to indicate that the CLI should monitor the submission for up to --monitor-timeout minutes (the default is 60). More details on how to use this option can be found here.

Exporting Pipelines

On export, Elyra performs two tasks: 1. packages dependencies for generic components and uploads them to s3-based cloud storage, and 2. generates pipeline code for the target runtime. The export subcommand, therefore, does not submit pipelines but instead prepares them for later execution by creating a runtime-specific file. It works similarly to submit, requiring a runtime configuration parameter, and also supports two additional optional parameters:

  1. --output: specifies the path and name of the exported file, defaulting to the current directory and pipeline name
  2. --overwrite: indicates to overwrite the output file if it already exists
$ elyra-pipeline export /path/to/read_and_write.pipeline \
--runtime-config kfp_dev_env \
--output /path/to/exported.yaml \
--overwrite

The exported file (YAML for Kubeflow Pipelines or Python DAG for Apache Airflow) can then be manually uploaded at any time for execution.

Example Use Case

The elyra-pipeline CLI is great for automating pipeline processing. For example, say you have a pipeline that loads and processes data to ready it for model training. The model must be trained with the latest data set every week, meaning the pipeline must be periodically executed. With the elyra-pipeline CLI, a single command can take care of this situation.

$ elyra-pipeline submit /path/to/read_and_write.pipeline \
--runtime-config kfp_dev_env

If you have several pipelines with the same requirements, you could create a shell script to automate the entire workflow. You can also add elyra-metadata commands as needed. The below script creates a runtime configuration and component catalog, then loops through to validate each pipeline file in the my_pipelines directory, submitting the pipeline for execution if validation succeeds. Finally, it removes the created metadata resources.

A bash script to automate pipeline submission

The Elyra CLI tools have clearly simplified this otherwise complicated and time-consuming workflow to a single command: bash simple_script.sh.

Conclusion

I appreciate that you took the time to read this post, and I hope the examples given here have provided a few ideas on how you could use the Elyra CLI tools to simplify your pipeline workflows. You can learn more about Elyra’s capabilities and features by checking out our other published resources. If there is any functionality that you feel would be useful for Elyra to support, let us know by opening an issue, reaching out in the community chatroom, or by joining our weekly community call. As an open source project, we welcome contributions and feedback of any kind. We look forward to hearing from you!

--

--

Kiersten Stokes
Center for Open Source Data and AI Technologies

Open Source Software Developer at Center for Open-Source Data & AI Technologies