Mastering Task Dependencies in Apache Airflow: Building Efficient Workflows

Agusmahari
3 min readMay 19, 2023

--

Apache Airflow provides a flexible and intuitive way to define dependencies between tasks in a DAG.

Photo by Mohammad Rahmani on Unsplash

Apache Airflow complex workflows. One of its key features is the ability to define Directed Acyclic Graphs (DAGs), which allow for the creation of intricate task dependencies. In this article, we will explore how to use Airflow to build dependencies between tasks in a DAG.

Dependencies in Airflow: Dependencies in Airflow refer to the relationships between tasks within a DAG. By specifying dependencies, you can control the order in which tasks are executed. In Airflow, a task can have one or more upstream tasks (dependencies) and one or more downstream tasks (dependents).

Defining a DAG: Let’s dive into the code example that demonstrates how to define a DAG with task dependencies:

from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

default_args = {
'owner': 'airflow',
'start_date': datetime(2023, 5, 19)
}

dag = DAG('dependency_dag', default_args=default_args, schedule_interval=None)

# Define tasks
task_1 = DummyOperator(task_id='task_1', dag=dag)
task_2 = DummyOperator(task_id='task_2', dag=dag)
task_3 = DummyOperator(task_id='task_3', dag=dag)
task_4 = DummyOperator(task_id='task_4', dag=dag)

# Define dependencies
task_1 >> task_2 # task_2 depends on task_1
task_1 >> task_3 # task_3 depends on task_1
task_2 >> task_4 # task_4 depends on task_2
task_3 >> task_4 # task_4 depends on task_3

Understanding the Code: In the provided example, we start by importing the necessary modules and defining default arguments for the DAG. The DAG is named “dependency_dag” and has no schedule_interval, meaning it will not be triggered on a regular basis but rather manually or via external triggers.

Next, we define four tasks: task_1, task_2, task_3, and task_4. The DummyOperator is used here, which represents a task that does nothing but serves as a placeholder. You can replace these with other operators based on your specific requirements.

To establish dependencies, we use the >> operator. In this case:

  • task_2 and task_3 both depend on task_1, meaning they will execute only if task_1 completes successfully.
  • task_4 depends on both task_2 and task_3, indicating that it will execute only when both task_2 and task_3 have successfully completed.

Apache Airflow provides a flexible and intuitive way to define dependencies between tasks in a DAG. By specifying these dependencies, you can orchestrate the execution order of tasks and create complex workflows. The example code showcased in this article demonstrates how to set up task dependencies using the >> operator, enabling you to build robust and efficient data pipelines with Airflow.

Remember, the key to designing effective DAGs lies in understanding the dependencies between tasks and structuring them accordingly.

BONUSS!!!

example of Apache Airflow code that defines a more complex Directed Acyclic Graph (DAG) with dependencies

DAG
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

default_args = {
'owner': 'airflow',
'start_date': datetime(2023, 5, 19)
}

dag = DAG('complex_dependency_dag', default_args=default_args, schedule_interval=None)

# Define tasks
task_1 = DummyOperator(task_id='task_1', dag=dag)
task_2 = PythonOperator(
task_id='task_2',
python_callable=lambda: print('Running task 2'),
dag=dag
)
task_3 = PythonOperator(
task_id='task_3',
python_callable=lambda: print('Running task 3'),
dag=dag
)
task_4 = PythonOperator(
task_id='task_4',
python_callable=lambda: print('Running task 4'),
dag=dag
)
task_5 = PythonOperator(
task_id='task_5',
python_callable=lambda: print('Running task 5'),
dag=dag
)
task_6 = DummyOperator(task_id='task_6', dag=dag)
task_7 = PythonOperator(
task_id='task_7',
python_callable=lambda: print('Running task 7'),
dag=dag
)

# Define dependencies
task_1 >> task_2 # task_2 depends on task_1
task_1 >> task_3 # task_3 depends on task_1
task_2 >> task_4 # task_4 depends on task_2
task_3 >> task_4 # task_4 depends on task_3
task_4 >> task_5 # task_5 depends on task_4
task_5 >> task_6 # task_6 depends on task_5
task_6 >> task_7 # task_7 depends on task_6
task_3 >> task_7 # task_7 depends on task_3

Agus mahari — agusmahari@gmail.com

--

--

Agusmahari

Data Enginner | Big Data Platform at PT Astra International Tbk. Let's connect on Linkedin https://www.linkedin.com/in/agus-mahari/