Hands-On Apache Airflow Tutorial
How to get started with Apache Airflow on your local machine, write and test your pipeline and deploy it to the Azure Data Factory Airflow service.
astronomer.io
Astronomer.io offers a very easy way to get started with the latest version of Apache Airflow on your local environment. You just need a running version of docker and docker-compose. Once you have that follow the steps in this link: https://docs.astronomer.io/astro/cli/overview
The ETL pipeline
Without getting into much detail about how DAGs used to work I want to explain Apache Airflow on a very simple example in Python. The ETL pipeline: https://airflow.apache.org/docs/apache-airflow/stable/tutorial/taskflow.html
import json
from pendulum import datetime
from airflow.decorators import (
dag,
task,
)
@dag(
schedule="@daily",
start_date=datetime(2023, 1, 1),
catchup=False,
default_args={
"retries": 2,
},
tags=["example"],
)
def example_dag_basic():
"""
### Basic ETL Dag
This is a simple ETL data pipeline example that demonstrates the use of
the TaskFlow API using three simple tasks for extract, transform, and load.
For more information on Airflow's TaskFlow API, reference documentation here:
https://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html
"""
@task()
def extract():
"""
#### Extract task
A simple "extract" task to get data ready for the rest of the
pipeline. In this case, getting data is simulated by reading from a
hardcoded JSON string.
"""
data_string = '{"1001": 301.27, "1002": 433.21, "1003": 502.22}'
order_data_dict = json.loads(data_string)
return order_data_dict
@task(
multiple_outputs=True
) # multiple_outputs=True unrolls dictionaries into separate XCom values
def transform(order_data_dict: dict):
"""
#### Transform task
A simple "transform" task which takes in the collection of order data and
computes the total order value.
"""
total_order_value = 0
for value in order_data_dict.values():
total_order_value += value
return {"total_order_value": total_order_value}
@task()
def load(total_order_value: float):
"""
#### Load task
A simple "load" task that takes in the result of the "transform" task and prints it out,
instead of saving it to end user review
"""
print(f"Total order value is: {total_order_value:.2f}")
order_data = extract()
order_summary = transform(order_data)
load(order_summary["total_order_value"])
example_dag_basic()
This is a basic ETL pipeline in Airflow. It consists of a pipeline definition @ dag and the pipeline steps @ task. The pipeline doesn’t care about the content of the tasks only about the order and frequency the tasks are run in. The tasks are written in Python and can contain any kind of code.
Here extract is run first and the return value is ordered_data. ordered_data is then passed to transform which outputs order_summary. Finally the returned value is the value to the key “total_order_value” in order_summary.
The pipeline is called as the final step: example_dag_basic().
Testing your DAG locally
Put the above Python file into the /dags folder in your astro directory. Once this is done start your Airflow instance with astro dev start.
You should be able to see at least one dag in your UI on http://localhost:8080.
After triggering the dag manually you should see it run and finish after a few seconds.
When clicking on the last task load and selecting “rendered” in the logs you should be able to see the output value:
This is the result of load(order_summary[“total_order_value”]).
Using this etl.py file as a template you can update the different tasks and run your own ETL pipelines in Apache Airflow.
Next up: Tutorial: Managed Airflow on Azure
If you found this article useful, please follow me.