Automate your machine learning workflow tasks using Elyra and Apache Airflow

Patrick Titzler
IBM Data Science in Practice
7 min readMar 23, 2021
field of daisies
Photo by author

In this blog post I’ll introduce a GUI-driven approach to creating Apache Airflow workflows that implement machine learning pipelines using Jupyter notebooks and Python scripts.

October 2021: This blog post refers to an older version of Elyra. Refer to the latest blog post for the latest developments. You can now add custom Apache Airflow operators to pipelines.

Apache Airflow is an open source workflow management platform that allows for programmatic creation, scheduling, and monitoring of workflows. In Apache Airflow, a workflow (or pipeline) is represented by a Directed Acyclic Graph (DAG) that comprises of one or more related tasks. A task represents a unit of work (such as the execution of a Python script) and is implemented by an operator. Airflow comes with built-in operators, such as the PythonOperator, and can be extended using provider packages.

Let’s say you want to use an Apache Airflow deployment on Kubernetes to periodically execute a set of Jupyter notebooks or scripts that load and process data in preparation for machine learning model training.

A basic two step workflow that loads and pre-processes data
A basic two step workflow that loads and pre-processes data

Without going into the details about the how, you could implement this workflow using the generic PythonVirtualEnvOperator to run the Python script and the special purpose PapermillOperatorto run the notebook. If the latter doesn’t meet your needs, you’d have to implement its functionality yourself by developing code that performs custom pre-processing, uses the papermillPython package to run the notebook, and performs custom post-processing.

An easier way to create a pipeline from scripts and notebooks is to use Elyra’s Visual Pipeline Editor. This editor lets you assemble pipelines by dragging and dropping supported files onto the canvas and defining their dependencies.

This pipeline comprises of multiple notebooks and a Python script, which implement a basic machine learning workflow.
This pipeline comprises of multiple notebooks and a Python script, which implement a basic machine learning workflow.

Once you’ve assembled the pipeline and are ready to run it, the editor takes care of generating the Apache Airflow DAG code on the fly, eliminating the need for any coding.

If you are new to the Elyra open source project, take a look at the overview documentation.

The pipeline editor can also generate code that runs the pipeline in JupyterLab or on Kubeflow Pipelines for greater flexibility.

Pipelines can run in local environments and remote environments, such as Kubeflow Pipelines or Apache Airflow.
Pipelines can run in local environments and remote environments, such as Kubeflow Pipelines or Apache Airflow.

For in depth information about pipelines, the pipeline editor, and how to run them refer to the Elyra tutorials.

Prerequisites

To create pipelines in Elyra and run them on Apache Airflow, you need

Creating a pipeline

Pipelines are created in Elyra with the Visual Pipeline Editor by adding Python scripts or notebooks, configuring their execution properties, and connecting the files to define dependencies.

Add notebooks to a pipeline and define their dependencies
Add notebooks to a pipeline and define their dependencies

Each pipeline node represents a task in the DAG and is executed in Apache Airflow with the help of a custom NotebookOp operator. The operator also performs pre- and post processing operations, that, for example, make it possible to share data between multiple tasks using shared cloud storage.

Node properties define the container image that the operator will run in, optional CPU, GPU, or RAM resource requests, file dependencies, environment variables, and output files. Output files are files that need to be preserved for later use after the node was processed. For example, a notebook that trains a model might want to save model files for later consumption, such as deployment.

Defining remote execution properties for a Python script.
Defining remote execution properties for a Python script

A pipeline that is comprised only of file nodes (nodes that execute a Python script or Jupyter notebook) can be run “as-is” locally in JupyterLab, or remotely in Apache Airflow or Kubeflow Pipelines.

Future releases of Elyra might provide support for node types that are specific to a runtime platform. Pipelines that include such nodes can take advantage of platform specific features but won’t be portable.

A pipeline definition does not include any target environment
information, such as the host name of the Apache Airflow web server. This information is encapsulated in Elyra in runtime configurations.

Creating a runtime configuration

A runtime configuration for an Apache Airflow deployment includes the Airflow web server URL, details about the GitHub repository where DAGs are stored, and connectivity information for the cloud storage service, which Elyra uses to store pipeline-run specific artifacts.

A sample runtime configuration for an Apache Airflow deployment
A sample runtime configuration for an Apache Airflow deployment

Elyra supports repositories on github.com and GitHub Enterprise. Note that some of the runtime configuration information is embedded in the generated DAGs to provide the NotebookOpoperator access to the configured cloud storage. Therefore you should always use a private repository to store DAGs that were produced by Elyra.

Running pipelines on Apache Airflow

Once you created a pipeline and a runtime configuration for your Apache Airflow cluster, you are ready to run the pipeline.

To run a pipeline, select the runtime platform and runtime configuration
To run a pipeline, select the runtime platform and runtime configuration.

When you submit a pipeline for execution from the Visual Pipeline Editor, Elyra performs the following pre-processing steps:

  • package the input artifacts (files and dependencies) for each task in a compressed archive
  • upload the archives to the cloud storage bucket referenced in the runtime configuration
  • generate a DAG, comprising of one task for each notebook or script
  • upload DAG to the GitHub repository that Apache Airflow is monitoring

The uploaded DAG is pre-configured to run only once.

Within limits, you can customize the generated DAG by exporting the pipeline instead of running it. The main difference between running and exporting a pipeline for Apache Airflow is that the latter does not upload the generated DAG file to the GitHub repository.

Monitoring a pipeline run on Apache Airflow

How soon a DAG is executed after it was uploaded to the repository depends on the git-sync refresh time interval setting and the scheduler in your Apache Airflow configuration.

The generated DAG is listed in the Apache Airflow web GUI.
The generated DAG is listed in the Apache Airflow web GUI.

Each notebook or script in the pipeline is executed as a task using Elyra’s NotebookOp operator in its own Kubernetes pod.

The example pipeline’s DAG is executed in Apache Airflow.
The example pipeline’s DAG is executed in Apache Airflow.

Once a task has been processed, its outputs can be downloaded from the associated cloud storage bucket. Outputs include the completed notebooks, an HTML version of each notebook, a log file for each Python script, and files that were declared as output files.

The produced outputs are stored in the preconfigured cloud storage bucket.
The produced outputs are stored in the preconfigured cloud storage bucket.

Getting started

If you have already have access to an v1.10.x Apache Airflow cluster you can start running pipelines in minutes:

If you are interested in running pipelines on Apache Airflow on the Red Hat OpenShift Container Platform, take a look at Open Data Hub. Open Data Hub is an open source project (just like Elyra) that should include everything you need to start running machine learning workloads in a Kubernetes environment.

Using Watson Studio services in pipelines

We’ve introduced pipelines in Elyra to make it easy to run Jupyter notebooks or scripts as batch jobs, and thus automate common repetitive tasks.

The recently launched beta of IBM Watson Studio Orchestration Flow takes this a step further. The flow orchestrator integrates with various data and AI services in Watson Studio, enabling users to ingest data, or train, test and deploy machine learning models at scale.

The orchestration flow editor, which is shown in the screen capture below, is based on the Elyra canvas open source project.

Example machine learning pipeline in Watson Studio Orchestration Flow.
Example machine learning pipeline in Watson Studio Orchestration Flow.

Questions and feedback

We’d love to hear from you! You can reach us in the community chatroom, the discussion forum, or by joining our weekly community call.

--

--

Patrick Titzler
IBM Data Science in Practice

Developer Advocate at Center for Open-Source Data & AI Technologies