Apache Airflow

Technocrat
CoderHack.com
Published in
3 min readSep 14, 2023

--

Apache Airflow is an open source platform to programmatically author, schedule and monitor workflows. Airflow has a modular architecture and uses a directed acyclic graph (DAG) to manage the tasks.

Airflow has a clean UI to visualize pipelines running in production, monitor progress and troubleshoot issues when needed. It handles dependencies between tasks automatically and causes the minimum amount of disruption when a task fails.

The main components of Airflow are:

  • DAGs: Directed Acyclic Graphs, the building blocks of Airflow
  • Operators: Define the tasks in DAGs
  • Tasks: Nodes in a DAG, define the work to execute
  • Pools: Define resources to execute tasks
  • Connections: Define connections to external systems
  • Variables: Parameters that can be reused
  • Schedules: Define schedules (eg. daily, hourly) to trigger DAGs
Photo by Campaign Creators on Unsplash

DAGs: The building blocks of Airflow

A DAG defines the tasks and their dependencies in a workflow. DAGs have a default schedule interval, start date, end date and other parameters.

DAGs are defined in .py files and have the following structure:

dag = DAG('dag_name', start_date=days_ago(1)) 

t1 = BashOperator(task_id='task_1', bash_command='echo 1', dag=dag)

t2 = BashOperator(task_id='task_2', bash_command='echo 2', dag=dag)

t3 = BashOperator(task_id='task_3', bash_command='echo 3', dag=dag)

t1 >> t2 >> t3

This defines a simple DAG with 3 tasks that are dependent on each other. t1 runs first, then t2 and t3.

We can trigger DAGs by scheduling them to run at certain intervals (daily, hourly etc.) or manually trigger them through the UI. We can monitor and visualize DAG runs through the UI.

Core Operators

Airflow has many built-in operators to perform various tasks. Some core operators are:

  • BashOperator: Execute a bash command
  • PythonOperator: Execute a Python function
  • EmailOperator: Send an email
  • SqlAlchemyOperator: Execute a SQL query in a database
  • Sensor: Waits for a certain time, file, database row etc.

For example, a BashOperator can be defined as:

t1 = BashOperator(task_id='task_1', bash_command='echo 1')

This will execute the bash command echo 1 for the task task_1.

Similarly, a PythonOperator can execute a Python function:

def print_hello():
print("Hello!")

t2 = PythonOperator(task_id='task_2', python_callable=print_hello)

Examples and Use Cases

Here are a few examples of using Airflow:

  • A simple ETL DAG:

Extract data from a source -> Transform the data -> Load into a database

  • DAG to pull data from an API and push to a database:

Make API call -> Extract data from response -> Load into database

  • DAG with a sensor and branching tasks:

Wait for a file (using FileSensor) -> If file exists, process it. Else wait. Process file -> Perform task A -> Perform task B (branch) OR Perform task C (branch)

  • Building a ML pipeline:

Extract training data -> Train ML model -> Evaluate model metrics -> Re-train model?

  • And many more! Airflow can be used for any workflow management use case.

Best Practices

Some best practices for using Airflow are:

  • Modularize DAGs and reuse common tasks
  • Parameterize DAGs and pass runtime parameters
  • Add error handling and set up alerting
  • Use pools to manage the number of tasks executing
  • Have a DAG deployment pipeline to push DAGs to production
  • Monitor, log and debug DAG runs to check for issues
  • Run Airflow on a cluster for scalability and high availability

Airflow is a very useful tool for any data engineer to build and manage common and complex data pipelines. By following the best practices, you can use Airflow to handle all your workflow management needs!

--

--