Hands-On Apache Airflow Tutorial

How to get started with Apache Airflow on your local machine, write and test your pipeline and deploy it to the Azure Data Factory Airflow service.

DataFairy
3 min readAug 28, 2023

astronomer.io

Astronomer.io offers a very easy way to get started with the latest version of Apache Airflow on your local environment. You just need a running version of docker and docker-compose. Once you have that follow the steps in this link: https://docs.astronomer.io/astro/cli/overview

The ETL pipeline

Without getting into much detail about how DAGs used to work I want to explain Apache Airflow on a very simple example in Python. The ETL pipeline: https://airflow.apache.org/docs/apache-airflow/stable/tutorial/taskflow.html

ETL pipeline in Airflow.
import json
from pendulum import datetime

from airflow.decorators import (
dag,
task,
)


@dag(
schedule="@daily",
start_date=datetime(2023, 1, 1),
catchup=False,
default_args={
"retries": 2,
},
tags=["example"],
)
def example_dag_basic():
"""
### Basic ETL Dag
This is a simple ETL data pipeline example that demonstrates the use of
the TaskFlow API using three simple tasks for extract, transform, and load.
For more information on Airflow's TaskFlow API, reference documentation here:
https://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html
"""

@task()
def extract():
"""
#### Extract task
A simple "extract" task to get data ready for the rest of the
pipeline. In this case, getting data is simulated by reading from a
hardcoded JSON string.
"""
data_string = '{"1001": 301.27, "1002": 433.21, "1003": 502.22}'

order_data_dict = json.loads(data_string)
return order_data_dict

@task(
multiple_outputs=True
) # multiple_outputs=True unrolls dictionaries into separate XCom values
def transform(order_data_dict: dict):
"""
#### Transform task
A simple "transform" task which takes in the collection of order data and
computes the total order value.
"""
total_order_value = 0

for value in order_data_dict.values():
total_order_value += value

return {"total_order_value": total_order_value}

@task()
def load(total_order_value: float):
"""
#### Load task
A simple "load" task that takes in the result of the "transform" task and prints it out,
instead of saving it to end user review
"""

print(f"Total order value is: {total_order_value:.2f}")

order_data = extract()
order_summary = transform(order_data)
load(order_summary["total_order_value"])


example_dag_basic()

This is a basic ETL pipeline in Airflow. It consists of a pipeline definition @ dag and the pipeline steps @ task. The pipeline doesn’t care about the content of the tasks only about the order and frequency the tasks are run in. The tasks are written in Python and can contain any kind of code.

Here extract is run first and the return value is ordered_data. ordered_data is then passed to transform which outputs order_summary. Finally the returned value is the value to the key “total_order_value” in order_summary.

The pipeline is called as the final step: example_dag_basic().

Testing your DAG locally

Put the above Python file into the /dags folder in your astro directory. Once this is done start your Airflow instance with astro dev start.

Running astro dev start locally.

You should be able to see at least one dag in your UI on http://localhost:8080.

After triggering the dag manually you should see it run and finish after a few seconds.

When clicking on the last task load and selecting “rendered” in the logs you should be able to see the output value:

This is the result of load(order_summary[“total_order_value”]).

Using this etl.py file as a template you can update the different tasks and run your own ETL pipelines in Apache Airflow.

Next up: Tutorial: Managed Airflow on Azure

If you found this article useful, please follow me.

--

--

DataFairy

Senior Data Engineer, Azure Warrior, PhD in Theoretical Physics, The Netherlands. I write about Data Engineering, Machine Learning and DevOps on Azure.