Apache Airflow is a popular open source data orchestration framework which allows developers to programmatically author and schedule workflows. At Omada, we use Airflow to dynamically create, schedule and monitor complex workflows that support various business needs.
This post assumes you have basic knowledge of Airflow and focuses on a high-level overview of how the data engineers at Omada have integrated Airflow as a core tool for creating and managing various Data and Machine Learning pipelines, triggering report generation and various data extracts from third party applications.
Omada’s Airflow Fork
At Omada, we decided to fork the Airflow repository and modified the original implementation in order to create a custom “DAG Dependency” feature — which enables explicitly defining upstream DAG dependencies in DAG definition. As it happened, many of our pipeline designs contained multiple DAGs that had various upstream and downstream DAGs. As a result, we heavily depended on Airflow Sensors to maintain the correct order of execution. Adding a sensor for all of our DAGs added additional overhead on workers and also used up many of the worker slots, sometimes blocking other operators. Instead of using Airflow’s sensors, we implemented our own solution that allows us to explicitly define DAG dependencies within the DAG definition. Let us try to understand the DAG dependency feature implemented at Omada in some more detail.
DAG Dependency Feature
Consider an example of defining non-cyclic DAG dependencies for a given DAG, dag_1. Based on the image below, dag_2 has an upstream dependency on dag_1. Similar to the way successful task completion triggers the next task in Airflow, here, dag_2 will wait for dag_1 to complete before it continues executing its tasks. Omada’s “DAG dependency” feature enables us to create a data pipeline by linking different workflows together without relying on Airflow sensors.
The approach of creating explicit inter- DAG dependency allows us to NOT define sensors for each DAG thus giving scheduling preference for other tasks to run and still get the ability to define the order of execution of DAGs. This enabled us to not depend on the sensors in our Airflow deployment.
One of the important workflows currently supported by Apache Airflow at Omada is the Data Pipeline. Data Pipeline consists of multiple extract and transform DAGs that as the name suggests extracts data from various internal Omada applications, transforms that data which is then used for reporting and analytics and finally stored in our warehouse. We will expand more on both types of workflows and explain the dependencies between them.
Extract DAGs serves as the starting point of our pipeline and are the first ones to run. “Extract DAGs” are all the DAGs that we have defined that extracts data from various internal Omada applications and stores it in our data warehouse (Redshift). We run all the DAGs for Data Pipeline once a day.
We have defined a single DAG for each table that we extract from some Omada application. For instance, if we are extracting `table_1`, then we will have a corresponding `app_table_1` DAG that will carry out different tasks to extract data from the corresponding “app” and insert into Redshift. Similarly, we have multiple DAGs corresponding to different tables we extract from many of Omada applications. These DAGs collectively are classified as Extract DAGs and the classification is only for contextual purpose so it is easy to understand the structure of our Data Pipeline.
Since extract DAGs are the first ones to run, it is important to note that none of the other DAGs will run until all of the Extract DAGs are completed because these DAGs serve as upstream dependencies for all the other DAGs in the pipeline (remember DAG Dependency? That’s right).
In addition to Extract DAGs, “Transform DAGs” further carry forward our Data Pipeline which uses an Operator that executes SQL on tables that were extracted and writes the result to a set of new tables. These Transform DAGs essentially transform and/or aggregate data from various sources and turn them into tables which are then used for analytics and reporting purposes. Naturally, these DAGs depend on data from other Omada applications, so we have to wait for all the extracts to complete before we can run them. Therefore, it has a dependency on Extract DAGs.
Together, the above two sets of DAGs make our Data Pipeline workflow in Airflow and once all of DAGs run in correct order and complete successfully, that concludes our pipeline as well. At the end of this workflow, we have all the latest data from various Omada applications in Redshift for our downstream consumers to consume.
Self-Serve DAG Definition
While creating an automated workflow for extracting and transforming data is awesome, our approach wasn’t very scalable. As our company grows and changes shape, so does our data. Therefore, our Data Analysts and Data Scientists must have the ability to change the type of data that they have access to. It would be inefficient to require a Data Engineer’s time to make each modification. We have created a “self-serve” option for Data Analysts and Data Scientists to create, update and delete data extracts and transforms.
The “self-serve” feature for creating new extractions can be used by updating a configuration file which contains details of all the tables and its corresponding columns that Airflow uses to create corresponding Extract DAG. Similarly, the “self-serve” feature for creating new transforms can be used by updating a configuration file that lists all the upstream DAG dependencies for that particular transform along with uploading a SQL file for a transform.
This configuration file is then used to dynamically create DAGs to be included in the data pipeline. This has allowed us to standardize all the DAGs that makes it easier for updating DAGs across the board.
In addition to the Data Pipeline workflow, Airflow also supports triggering the generation of various internal and external reports. We generate multiple internal and external reports to send out to our various clients and internal users. These reports have different cadence for delivery based on the report type or client requirements.
Airflow has knowledge of different reports that needs to be generated along with their cadence on when it needs to be generated. Triggering of report generation is done by creating Report DAGs in Airflow, where each Report DAG corresponds to a single report. Each of these Report DAGs have a report generation task which is responsible for triggering report generation in another application. Everyday at the conclusion of the Data Pipeline, all of the Report DAGs are scheduled to carry out a report generation task which generates the corresponding report.
Machine Learning Pipeline
Another important workflow supported by Airflow is the Machine Learning Pipeline. Our Data Scientists at Omada have created various models which provide great insights to engineering teams, product teams and other stakeholders. These models need to run everyday with the most recent data to generate predictions and insights.
We have created a DAG for each of these models in Airflow which gets triggered as soon as all the extracts that these models depend on are completed. Thus, we can automate running models and generating predictions in Airflow on a daily basis which enables all the stakeholders to receive new insights or predictions everyday. Following diagram illustrates how each Data Science DAG depends on upstream extract DAGs before they can generate predictions.
Also, more importantly, this enables Data Scientists to only focus on building models or improving existing models without worrying about the pipeline completely. Once a model is changed/refactored, the pipeline picks it up and runs the newer model as per its schedule and delivers insights which are ready to be consumed.
In conclusion, I would like to say that at Omada we truly leverage the power of Airflow to carry out various operations and it has been an integral part of the Data Platform. It has truly empowered Data Analysts and Scientists by providing them with data from other Omada apps and various external sources consistently and without any data quality issues. This has enabled Data Analysts to report on data gathered from different sources which are then sent to our clients and Data Scientists to run different models on regular cadence; thus providing new insights.