Getting Started with Apache Airflow Operators in Elyra
With the release of version 3.6, Elyra has made several enhancements that provide more complete support for the development of Apache Airflow workflows. In this blog post, I’ll go over the basics of Apache Airflow and how to use Elyra to create pipelines that include a variety of Apache Airflow operators.
Consider reading this blog post for a more in-depth discussion of some of the Elyra concepts referenced here, such as custom components and component catalogs, and to learn more about Elyra’s Kubeflow Pipelines support.
Apache Airflow
Apache Airflow is an open-source workflow orchestration tool that provides for creation, execution, and tracking of workflows. A workflow, or pipeline, can be thought of as a sequence of steps taken to achieve a certain data transformation goal. Each step, also called a task, corresponds to a single unit of work in the sequence. In Apache Airflow, a workflow is represented by a Directed Acyclic Graph (DAG) where each node of the graph corresponds to a task and each edge represents the dependencies between tasks. The implementation of a task in Airflow is detailed in a re-usable Python class called an operator. Each task is therefore an instantiation of an operator. Airflow deployments come with a set of built-in operators for common tasks. Airflow can also be extended with additional operators by linking any number of provider packages.
Apache Airflow is code-first, meaning that a workflow DAG and its associated operator tasks are defined programmatically. This is what makes Airflow so powerful and flexible, but it can also be a bit cumbersome for those that don’t have familiarity with Python or when writing a DAG for a very complex or very large pipeline. This is where Elyra’s Visual Pipeline Editor (VPE) can simplify things.
Elyra Pipelines and the VPE
The Elyra VPE makes it possible to build a pipeline without writing any code. The interface includes a palette (on the left), the canvas (in the center), and a properties panel (on the right), as shown below.
The palette is where the available operators (more generally referred to as components in the Elyra documentation) are listed. Assemble a pipeline by dragging components from the palette onto the canvas and connecting them as appropriate. Elyra also exposes each component’s input and output properties in the properties panel.
Note that only input properties will be displayed for Airflow components (operators)
As seen in the screenshot above, when building an Apache Airflow-specific pipeline, there are two types of components available in the palette:
- generic components for executing Jupyter notebooks, Python scripts, and R scripts. These can be seen under the Elyra category in the palette
- runtime-specific components (also called custom components) each of which corresponds to a single Airflow operator and is shown with the Apache Airflow logo to the left of its name. The set seen here are organized under a Core packages category
Custom components are not included with Elyra by default and must therefore be managed separately using Elyra’s component catalog feature.
Component Catalogs
Elyra does not include its own component repository and relies on component catalogs and their associated connectors to manage the components available in the palette. Catalog connectors serve two purposes:
- retrieve a list of components that a given catalog makes available
- fetch the component definition in order to expose its input and output properties
Elyra provides three built-in component catalog connector types: filesystem catalogs and directory catalogs, which provide access to components stored in a filesystem either individually or together in a directory, and a URL catalog that provides access to components available via anonymous HTTP GET request. Additional connectors are available in the Elyra examples repository. Unlike built-in types, you must explicitly install the connector in order to access components from these connector types. If none of the pre-built connectors fit your needs, you can also build your own connector by following the instructions here.
Elyra version 3.6 builds on this by adding two new types of built-in catalog connectors that are Airflow-specific: an Airflow package catalog connector and an Airflow provider package catalog connector. These connectors are available in the Elyra distribution with no need to install any additional packages.
Airflow Package Catalog Connector
The Airflow Package catalog connector allows users access to the built-in (or ‘core’) operators for a given Airflow distribution. Some of these operators include the BashOperator, which executes a bash command, and the EmailOperator, which sends an email to a list of recipients. A full list can be found here.
To configure an Airflow package catalog, first open the ‘Component Catalogs’ panel from JupyterLab. Select the +
symbol, then ‘New Apache Airflow package operator catalog’.
Enter a name for the catalog and modify the description, if desired. The runtime is set automatically as Apache Airflow. Optionally specify one or more categories for the operators that will be retrieved by this catalog. These category names are used to organize the operators in the palette. The single category ‘Core packages’ is assigned by default.
Lastly, you’ll need to configure the Airflow package download URL
. The URL must meet a few constraints:
- it must point to a built distribution (wheel) file
- it must reference a location that Elyra can access using an HTTP GET request without the need to authenticate
For example, for a distribution available on PyPI, open the package’s ‘Release history’ and choose your desired version.
From the appropriate version’s page, click on the ‘Download files’ link and copy the link to the package wheel file.
Paste the copied link into the appropriate field and click ‘Save & Close’. The core operators should now be available in the VPE palette under the specified categories, which in this case is the default category, ‘Core packages’.
While not strictly required, you should also make sure that the version of the package download URL that you are using matches the version of the Apache Airflow package in your cluster. If they do not match, changes made to an operator between versions could cause an error during pipeline execution.
Like the existing catalog connector types, the new Airflow Provider Package Connector also facilitates sharing pipelines between Elyra deployments. For example, assume User A adds operators from archive apache_airflow-1.10.15-py2.py3-none-any.whl
to their Elyra deployment and creates a pipeline using those operators. User B has their own Elyra deployment and configures a catalog to add operators from older archive apache_airflow-1.10.12-py2.py3-none-any.whl
. In this situation, User B can still run pipelines that User A created, and vice versa, because the operators are internally identified using a set of descriptors that do not include the version number of the configured archive download URL. As mentioned above, operators may differ slightly between versions, and changes to certain node properties may be required in order to guarantee the same functionality in the shared pipeline.
Note that the connector should already support Airflow version 2.x packages, but these have not been tested as Elyra does not yet support Airflow 2.x.
Airflow Provider Package Catalog Connector
If an operator isn’t installed with Airflow by default, there is a huge selection of provider packages that very likely contain operators that implement the needed functionality. Provider packages are written and maintained by the Apache Airflow community and must be installed separately on the Airflow cluster. The apache-airflow-providers-amazon
package and the apache-airflow-providers-google
package are two examples of community-maintained provider packages that implement support for a variety of Amazon and Google-related tasks, respectively. A full list of provider packages can be found here. Users can also write their own providers to extend core functionality as required.
The instructions for configuring an Airflow provider package catalog to make its operators available in Elyra are much the same as the instructions for configuring an Airflow package catalog. When adding a new catalog from the ‘Component Catalogs’ panel, select ‘New Apache Airflow provider package operator catalog’. The remaining instructions are the same, including the requirements for the form of the Provider package download URL
: a .whl file accessible from an anonymous HTTP GET request. Provider packages are commonly available on PyPI. Ensure that you are selecting the best version for your Airflow cluster.
The screenshot below shows a palette with several custom components imported from the Slack provider package on PyPI.
Again, keep in mind that the provider package connector has not been tested for Airflow 2.x provider packages as Elyra does not yet support Airflow 2.x. If you are interested in implementing support for Airflow 2.x, please considering contributing this feature or suggesting related improvements as part of Elyra’s ‘Bring Your Own Runtime’ objective.
Now that the package operators have been added to the palette, there is just one last step to complete in order to successfully export and submit Airflow pipelines that include custom components.
Configuring Elyra to Support Custom Airflow Components
Elyra and its VPE simplifies the pipeline creation and submission process for Airflow by removing the need to manually write the code to define the DAG. However, in order to successfully render the DAG for a given pipeline, Elyra must be configured to include information on the fully qualified custom operator classes used in that pipeline. This configuration supports Elyra’s ability to correctly render the import statements for all operator nodes in a given DAG.
If you do not correctly configure an operator package name and try to export or submit a pipeline, Elyra will give you an error message similar to the following:
As noted in the screenshot, the operators’ fully qualified package names must be added to the available_airflow_operators
variable. This variable is a list value and is a configurable trait in Elyra. To configure available_airflow_operators
, first create a configuration file from the command line:
$ jupyter elyra --generate-config
Open the created file and find the PipelineProcessor(LoggingConfigurable)
header. Add the following line under that header.
c.AirflowPipelineProcessor.available_airflow_operators.extend([...])
Replace the …
with the fully qualified package string for each operator class as needed. For example, if you want to use the SlackAPIPostOperator from the Slack provider package and the PapermillOperator from the core package in your pipelines, your configuration will look like this:
#------------------------------------------------------------------
# PipelineProcessor(LoggingConfigurable) configuration
#------------------------------------------------------------------c.AirflowPipelineProcessor.available_airflow_operators.extend(
[
"airflow.providers.slack.operators.SlackAPIPostOperator",
"airflow.operators.papermill_operator.PapermillOperator"
]
)
There is no need to restart JupyterLab in order for these changes to be picked up. Pipeline export and submit will now succeed!
The Elyra team is currently working on automating this configuration step for the Airflow package connectors covered here. This step will still be required when using other types of connector catalogs, such as URL and filesystem catalogs.
Conclusion
Thank you for reading and for your interest in Elyra. As an open-source project, we welcome contributions and feedback of any kind, from bug reports to new features to improved documentation. Find details about how you can reach out with questions, comments, or suggestions in the Getting help topic in the documentation. We look forward to hearing from you!