Cross-DAG Dependencies in Apache Airflow: A Comprehensive Guide

Exploring four methods to effectively manage and scale your data workflow dependencies with Apache Airflow.

Frederic Vanderveken
datamindedbe
11 min readJul 4, 2023

--

As organizations scale their data operations, the need for structure and streamlined processes becomes paramount. This is typically handled by tools like Apache Airflow, offering a code-driven approach to orchestrate and monitor data workloads through directed acyclic graphs (DAGs). Within this framework, each DAG embodies a unique data pipeline or workflow. Hence, when scaling out the number of data processes and simultaneously guaranteeing data freshness, properly managing these DAGs and their interdependencies becomes crucial.

In this blog post, we explore four prominent methods to manage and implement cross-DAG dependencies in Airflow — Triggers, Sensors, Datasets, and the Airflow API. We explain their behavior, provide example code snippets and compare the advantages and disadvantages of every method, empowering you to make informed decisions tailored to your specific use case.

Why are cross-DAG dependencies useful?

Each DAG comprises several interconnected tasks, enabling an organization to easily consolidate all its data workflows into a set of DAGs. An example DAG with four tasks is illustrated in Fig. 1. To adhere to Airflow’s best practices, each DAG must have a well-defined scope [1,2]. This means that it should entail a single project or process, be managed by a single team, and have a specific execution schedule. However, as organizations grow, these independent workflows often need to interact or synchronize with each other. Here, the issue of cross-DAG dependencies arises. An illustrative example of this complexity is shown in Fig. 2 where all data workflows are captured in DAGs and several dependencies between them exist.

Fig. 1: DAG with four tasks.
Fig 2: DAG dependencies become complicated when the number of DAGs increases.

Cross-DAG dependencies refer to scenarios where a DAG relies on the successful execution of one or more tasks in another DAG. Unmanaged cross-DAG dependencies can create several problems:

  • Data inconsistencies: if a DAG begins processing before its data is ready, it can lead to wrong or inaccurate results.
  • Inefficient resource usage: if a DAG begins processing before its data is ready, it needs to be re-executed later.
  • Bugs: when inaccurate results appear due to a mismatch between different DAGs, it can be very time consuming to identify the root cause. Yet, its resolution often simply involves a coordinated re-execution of the two DAGs.

Unmanaged cross-DAG dependencies can create several problems, such as data inconsistencies or inefficient resource usage.

One way to mitigate this problem could be to schedule the dependent DAGs after each other and hope for the best. Note that this is not a scalable nor robust approach, as it only works for a limited amount of coupled DAGs and requires that DAGs won’t fail. Another naive solution to resolve the increased complexity of cross-DAG dependencies would be to merge the DAGs that depend on each other into one single DAG. However, this approach violates the best practices of having a DAG reflect a single process and being maintained by a single team. Moreover, integrating DAGs into each other leads to large monolithic DAGs that are hard to maintain.

As such, understanding and effectively handling cross-DAG dependencies is a crucial aspect of working with Apache Airflow. A well-managed system allows you to efficiently orchestrate complex workflows, maintain data consistency, and optimally utilize resources, all while preserving the simplicity and readability of individual DAGs. Luckily, Airflow provides several approaches for managing cross-DAG dependencies without violating the best practices.

1. Triggers

The trigger method is a straightforward way to activate downstream DAGs when a task in an upstream DAG completes successfully. This push-based mechanism is simple, yet effective, making it an ideal solution for activating one or multiple downstream DAGs. Figure 3 shows a schematic illustration of how DAG coupling with the trigger method can work.

Fig. 3: Schematic illustration of cross-DAG coupling via the TriggerDagRunOperator. Trigger task A and trigger task B in the upstream DAG respectively trigger downstream DAG A and downstream DAG B.

To bring this method to action, the TriggerDagRunOperator has to be incorporated as an auxiliary task within the upstream DAG, as can be seen in the code snippet below. The operator takes the trigger_dag_id argument, which specifies the downstream DAG to be triggered. Optionally, the conf argument can be used which enables passing variables from the upstream DAG to the downstream DAG, providing extra functionality in the cross-DAG coupling.

# upstream DAG
with DAG(
dag_id="upstream_trigger_dag",
start_date=datetime(2023, 1, 1),
schedule="@daily",
) as dag:
start_task = BashOperator(
task_id="start_task",
bash_command="echo 'Start task'",
)
trigger_A = TriggerDagRunOperator(
task_id="trigger_A",
trigger_dag_id="downstream_dag_A",
conf={"message":"Message to pass to downstream DAG A."},
)
trigger_B = TriggerDagRunOperator(
task_id="trigger_B",
trigger_dag_id="downstream_dag_B",
)
start_task >> trigger_A >> trigger_B

# downstream DAG A
with DAG(
dag_id="downstream_dag_A",
start_date=datetime(2023, 1, 1),
schedule="@daily",
) as dag:
downstream_task = BashOperator(
task_id="downstream_task",
bash_command='echo "Upstream message: $message"',
env={"message": '{{ dag_run.conf.get("message") }}'},
)

2. Sensors

The sensor method is another way to establish cross-DAG dependencies in Apache Airflow. It is particularly useful when the execution of a downstream DAG relies on the completion of tasks in one or more upstream DAGs. In this scenario, sensors serve as unique task types in Airflow, flexibly responding to the status of upstream tasks, upstream task groups, or entire upstream DAGs. A schematic illustration of the sensor DAG coupling is shown in Fig. 4.

Fig. 4: Schematic illustration of cross-DAG coupling via the sensor method. Sensor task A and sensor task B in the downstream DAG respectively wait on the completion of the upstream end and start tasks in DAG A and B.

To establish cross-DAG dependencies using a sensor, the downstream DAG needs to include the ExternalTaskSensor, as can be seen in the code snippet below. Additionally, the dependent upstream DAGs and tasks must be imported and provided to the ExternalTaskSensor using the external_dag_id and external_task_id arguments. Once these are set, the ExternalTaskSensor continuously monitors the upstream tasks and DAGs for their completion.

with DAG(
"downstream_sensor_dag",
start_date=datetime(2023, 1, 1),
schedule=@daily,
) as dag:
start_task = BashOperator(
task_id="start_task",
bash_command="echo 'Start task'",
)
sensor_A = ExternalTaskSensor(
task_id="sensor_A",
external_dag_id="upstream_dag_A",
external_task_id="end_task",
)
sensor_B = ExternalTaskSensor(
task_id="sensor_B",
external_dag_id="upstream_dag_B",
external_task_id="start_task",
)
start_task >> sensor_A >> sensor_B

In the downstream DAG, the sensor task executes only when all upstream tasks share the same execution date and are marked as successful. Hence, if one of the upstream tasks fails, the downstream sensor task won’t proceed. This behavior is illustrated in Fig. 5 for a slightly more complicated scenario of a single downstream sensor task that depends on two upstream tasks with different schedules.

Fig. 5: Schematic illustration of scheduling with sensors. In the downstream DAG, the sensor task executes only when all upstream tasks share the same execution date and are marked successful.

Another scenario that can lead to problems is when the upstream tasks are shifted in time, resulting in execution times that will never align between the upstream and downstream tasks, and consequently, the downstream task will never start. However, this issue can be resolved by using the timedelta argument in the ExternalTaskSensor, which virtually shifts the execution time of the downstream ExternalTaskSensor task to match that of the upstream task.

Comparing this sensor approach with the trigger approach, an operational difference is identified. The sensor operation reflects a pull-based model where the downstream DAG continuously verifies whether the upstream tasks were successful, whereas the trigger method has a push-based approach as the upstream tasks directly activate the downstream tasks after a successful run.

It is also possible to include an ExternalTaskMarker in the upstream DAGs, which connects to the ExternalTaskSensor in the downstream DAG. Although these marker tasks are optional, they are highly recommended because they provide a clear dependency between the two DAGs. Furthermore, when including the marker in the upstream DAG, the Airflow UI displays an additional button that directs users straight to the downstream DAG from the upstream one, see Fig. 6. This enhances visibility and navigation between the coupled DAGs.

Fig. 6: ExternalTaskMarker view in the Airflow UI. The “External DAG” button directs to the coupled downstream DAG.

3. Datasets

The dataset approach in Apache Airflow provides a powerful method for realizing cross-DAG dependencies by creating links between datasets and DAGs. It allows the user to specify a specific dataset that DAGs can follow, consequently executing the subscribed DAGs whenever there’s a change in the dataset.

Scheduling DAG runs based on dataset updates enables you to create dynamic and event-driven workflows. This approach is particularly useful when downstream DAGs need to run after dataset updates from upstream DAGs, especially in cases where the updates are irregular. Hence, establishing dependencies based on dataset changes ensures that downstream workflows react effectively to data updates. A schematic overview of this approach is illustrated in Fig. 7.

Fig. 7: Schematic illustration of cross-DAG coupling via datasets. The downstream DAG only starts when both datasets are updated.

To implement cross-DAG dependencies using datasets, the datasets need to be defined in both the upstream and downstream DAGs. It’s important to note that the provided dataset is simply a string identifier, and Airflow doesn’t establish a direct connection to the actual dataset or have knowledge of its content, location, or actual modifications applied to it. Hence, this makes the connections very flexible as you don’t need specific connectors or plugins to communicate the actual database. On the other hand, it is important to remember that all dataset updates which do not originate from Airflow tasks are not captured.

In the code, the method is implemented by explicitly indicating when dataset changes happen. In the upstream DAG, each task that updates a dataset should include the outlets parameter, specifying the dataset that will be updated by that task. In the downstream DAG, the datasets (on which the DAG depends) should be defined as the schedule parameter. Note that you can include multiple datasets by providing them as a list in the schedule attribute. An example implementation of cross-DAG coupling with datasets is presented in the code block below.

dataset1 = Dataset('s3://folder1/dataset_1.txt')
dataset2 = Dataset('s3://folder1/dataset_2.txt')

with DAG(
dag_id="upstream_dag_A",
start_date=datetime(2023, 1, 1),
schedule=@daily,
) as dag:
start_task = BashOperator(
task_id="start_task",
bash_command="echo 'Start task'",
outlets=[dataset1],
)

with DAG(
dag_id="upstream_dag_B",
start_date=datetime(2023, 1, 1),
schedule=@daily,
) as dag:
start_task = BashOperator(
task_id="start_task",
bash_command="echo 'Start task'",
outlets=[dataset2],
)

with DAG(
"downstream-dataset-dag",
start_date=datetime(2023, 1, 1),
schedule=[dataset1, dataset2],
) as dag:
start_task = BashOperator(
task_id="start_task",
bash_command="echo 'Start task'",
)
end_task = BashOperator(
task_id="end_task",
bash_command="echo 'End task'",
)
start_task >> end_task

During runtime, the DAG is scheduled to run when all the datasets it consumes have been updated at least once since the last time it was run. This ensures that the downstream DAG only runs when the required datasets have been modified or refreshed by the corresponding upstream tasks. An example is shown in Fig. 8.

Fig. 8: Schematic of scheduling with datasets. The black dots represent dataset updates and the orange lightning icon represents the DAG run. The downstream DAG only runs when all the datasets it consumes have been updated at least once since the last time it was run.

To help users overview all the cross-DAG dataset connections, Airflow provides a dedicated dataset view in the user interface. This serves as a powerful tool for monitoring and optimizing your data pipelines. It provides visibility into dataset update times as well as the relationships between upstream and downstream DAGs, allowing the user to trace back the data flow and cross-DAG dependencies. Hence, this feature can also be used for data lineage purposes. An example of the dataset view is shown in Fig. 9.

Fig. 9: Dataset view in the Airflow interface.

By leveraging the dataset approach, you can create flexible and reactive workflows that adapt to changes in your data environment. Whether you need to trigger downstream DAGs after dataset updates or build complex data pipelines with intricate dependencies, datasets in Airflow provide a robust mechanism for achieving cross-DAG coupling based on data awareness.

4. API

The Airflow API offers a powerful approach to trigger a DAG from an external source. When the external source is another Airflow instance, it is possible to create cross-DAG dependencies with the API, complementing the triggers, sensors, and datasets methods discussed earlier. Hence, this method is especially beneficial when your dependent DAGs are deployed in different Airflow environments. By leveraging the Airflow API, you can seamlessly couple tasks and trigger DAG runs across diverse environments, enabling efficient cross-environment workflow coordination.

Cross-DAG coupling via the API is realized by a POST request to the DAGRuns endpoint, which allows to initiate a remote DAG run. To implement this, the SimpleHttpOperator should be included in the DAG and the endpoint parameter should be set to /api/v1/dags/<dag-id>/dagRuns, where <dag-id> represents the ID of the target DAG. The desired execution date can be specified in the data parameter, which can be the execution date of the upstream DAG or any other date. Make sure to configure the HTTP connection (http_conn_id) with the necessary authentication credentials. New connections can be created in the admin>connections page on the Airflow user interface. Once the API call triggering the downstream DAG completes, the corresponding task in the upstream DAG is marked as complete. See the code block below for an example implementation.

date = "{{ execution_date }}"
request_body = {"execution_date": date}
json_body = json.dumps(request_body)

with DAG(
"upstream_api_dag",
start_date=datetime(2023, 1, 1),
schedule=@daily,
) as dag:
start_task = BashOperator(
task_id="start_task",
bash_command="echo 'Start task'",
)
api_trigger_dag = SimpleHttpOperator(
task_id="api_trigger_dag",
http_conn_id="airflow-api",
endpoint="/api/v1/dags/dependent-dag/dagRuns",
method="POST",
headers={"Content-Type": "application/json"},
data=json_body,
)
start_task >> api_trigger_dag

Comparing the cross-DAG dependency methods

In the previous sections, we elaborated on four different cross-DAG coupling methods. In this section, we highlight their key differences and similarities. This comparison should enable you to identify the appropriate method when you encounter a situation where you need to implement a cross-DAG dependency.

Table 1: Comparison of the four cross-DAG coupling methods.

From my experience, the dataset approach is typically the most suitable method to create cross-DAG dependencies. Its greatest advantage, in my opinion, is its utilization of an event-driven coupling principle. This effectively abstracts away the time scheduling aspect, simplifying the implementation and maintainability for the user. In addition, it facilitates clear data lineage and easily allows for both one-to-many and many-to-one links. Hence, if you’re utilizing an Airflow version of 2.4 or above, I recommend applying the dataset approach when implementing cross-DAG dependencies.

Conclusion

Implementing cross-DAG dependencies in Airflow empowers you to create complex and interdependent workflows without sacrificing maintainability and organizational structure. In this blog post, we explored four methods for implementing cross-DAG dependencies in Apache Airflow:

  • Triggers: activate downstream DAGs as an upstream task completes, offering a push-based approach.
  • Sensors: use a pull-based model and wait for specific upstream events to finish.
  • Datasets: ideal for handling non-regular updates, and creating event-driven and dynamic workflows. Furthermore, it allows for reasoning about data dependencies, instead of code-level or technical connections in the other methods.
  • Airflow API: enables efficient coordination by linking tasks and initiating DAG runs for dependencies across different Airflow environments.

By choosing the appropriate method based on your workflow needs, you can ensure efficient coordination between tasks and easily scale the number of pipelines while keeping them maintainable. Hence, by leveraging the cross-DAG capabilities provided by Airflow, it is possible to build a robust and interconnected DAG ecosystem that drives your data-driven processes forward.

Thanks to Jonny Daenen for providing valuable input and feedback on this post.

Sources

  1. https://www.astronomer.io/blog/10-airflow-best-practices/
  2. Apache Airflow Best Practices and Advantages | by Xenonstack | Digital Transformation and Platform Engineering Insights | Medium

--

--