Automate your machine learning workflow tasks using Elyra and Apache Airflow
In this blog post I’ll introduce a GUI-driven approach to creating Apache Airflow workflows that implement machine learning pipelines using Jupyter notebooks and Python scripts.
October 2021: This blog post refers to an older version of Elyra. Refer to the latest blog post for the latest developments. You can now add custom Apache Airflow operators to pipelines.
Apache Airflow is an open source workflow management platform that allows for programmatic creation, scheduling, and monitoring of workflows. In Apache Airflow, a workflow (or pipeline) is represented by a Directed Acyclic Graph (DAG) that comprises of one or more related tasks. A task represents a unit of work (such as the execution of a Python script) and is implemented by an operator. Airflow comes with built-in operators, such as the PythonOperator
, and can be extended using provider packages.
Let’s say you want to use an Apache Airflow deployment on Kubernetes to periodically execute a set of Jupyter notebooks or scripts that load and process data in preparation for machine learning model training.
Without going into the details about the how, you could implement this workflow using the generic PythonVirtualEnvOperator
to run the Python script and the special purpose PapermillOperator
to run the notebook. If the latter doesn’t meet your needs, you’d have to implement its functionality yourself by developing code that performs custom pre-processing, uses the papermill
Python package to run the notebook, and performs custom post-processing.
An easier way to create a pipeline from scripts and notebooks is to use Elyra’s Visual Pipeline Editor. This editor lets you assemble pipelines by dragging and dropping supported files onto the canvas and defining their dependencies.
Once you’ve assembled the pipeline and are ready to run it, the editor takes care of generating the Apache Airflow DAG code on the fly, eliminating the need for any coding.
If you are new to the Elyra open source project, take a look at the overview documentation.
The pipeline editor can also generate code that runs the pipeline in JupyterLab or on Kubeflow Pipelines for greater flexibility.
For in depth information about pipelines, the pipeline editor, and how to run them refer to the Elyra tutorials.
Prerequisites
To create pipelines in Elyra and run them on Apache Airflow, you need
- JupyterLab 3.0 with the Elyra extensions v2.1 (or later) installed
- Apache Airflow v1.10.8 or later (v2.x has not been tested!)
- A configuration of Apache Airflow that uses the Kubernetes Executor with git-sync enabled and has it enabled for Elyra
Creating a pipeline
Pipelines are created in Elyra with the Visual Pipeline Editor by adding Python scripts or notebooks, configuring their execution properties, and connecting the files to define dependencies.
Each pipeline node represents a task in the DAG and is executed in Apache Airflow with the help of a custom NotebookOp
operator. The operator also performs pre- and post processing operations, that, for example, make it possible to share data between multiple tasks using shared cloud storage.
Node properties define the container image that the operator will run in, optional CPU, GPU, or RAM resource requests, file dependencies, environment variables, and output files. Output files are files that need to be preserved for later use after the node was processed. For example, a notebook that trains a model might want to save model files for later consumption, such as deployment.
A pipeline that is comprised only of file nodes (nodes that execute a Python script or Jupyter notebook) can be run “as-is” locally in JupyterLab, or remotely in Apache Airflow or Kubeflow Pipelines.
Future releases of Elyra might provide support for node types that are specific to a runtime platform. Pipelines that include such nodes can take advantage of platform specific features but won’t be portable.
A pipeline definition does not include any target environment
information, such as the host name of the Apache Airflow web server. This information is encapsulated in Elyra in runtime configurations.
Creating a runtime configuration
A runtime configuration for an Apache Airflow deployment includes the Airflow web server URL, details about the GitHub repository where DAGs are stored, and connectivity information for the cloud storage service, which Elyra uses to store pipeline-run specific artifacts.
Elyra supports repositories on github.com and GitHub Enterprise. Note that some of the runtime configuration information is embedded in the generated DAGs to provide the
NotebookOp
operator access to the configured cloud storage. Therefore you should always use a private repository to store DAGs that were produced by Elyra.
Running pipelines on Apache Airflow
Once you created a pipeline and a runtime configuration for your Apache Airflow cluster, you are ready to run the pipeline.
When you submit a pipeline for execution from the Visual Pipeline Editor, Elyra performs the following pre-processing steps:
- package the input artifacts (files and dependencies) for each task in a compressed archive
- upload the archives to the cloud storage bucket referenced in the runtime configuration
- generate a DAG, comprising of one task for each notebook or script
- upload DAG to the GitHub repository that Apache Airflow is monitoring
The uploaded DAG is pre-configured to run only once.
Within limits, you can customize the generated DAG by exporting the pipeline instead of running it. The main difference between running and exporting a pipeline for Apache Airflow is that the latter does not upload the generated DAG file to the GitHub repository.
Monitoring a pipeline run on Apache Airflow
How soon a DAG is executed after it was uploaded to the repository depends on the git-sync refresh time interval setting and the scheduler in your Apache Airflow configuration.
Each notebook or script in the pipeline is executed as a task using Elyra’s NotebookOp
operator in its own Kubernetes pod.
Once a task has been processed, its outputs can be downloaded from the associated cloud storage bucket. Outputs include the completed notebooks, an HTML version of each notebook, a log file for each Python script, and files that were declared as output files.
Getting started
If you have already have access to an v1.10.x Apache Airflow cluster you can start running pipelines in minutes:
- Prepare your cluster for use by Elyra
- Install Elyra from PyPI or pull the ready-to-use container image.
- Step through the tutorial to learn how to create a pipeline, configure a pipeline, and run or export it.
If you are interested in running pipelines on Apache Airflow on the Red Hat OpenShift Container Platform, take a look at Open Data Hub. Open Data Hub is an open source project (just like Elyra) that should include everything you need to start running machine learning workloads in a Kubernetes environment.
Using Watson Studio services in pipelines
We’ve introduced pipelines in Elyra to make it easy to run Jupyter notebooks or scripts as batch jobs, and thus automate common repetitive tasks.
The recently launched beta of IBM Watson Studio Orchestration Flow takes this a step further. The flow orchestrator integrates with various data and AI services in Watson Studio, enabling users to ingest data, or train, test and deploy machine learning models at scale.
The orchestration flow editor, which is shown in the screen capture below, is based on the Elyra canvas open source project.
Questions and feedback
We’d love to hear from you! You can reach us in the community chatroom, the discussion forum, or by joining our weekly community call.