Introduction to Apache Airflow

A beginners guide to the industry standard for batch ETL jobs written in Python. Get started with Apache Airflow, the open-source tool you don’t want to miss out on.

DataFairy
3 min readAug 24, 2023

What is Airflow?

Airflow started in 2014 as an open-source Github project and in 2015 was further developed at Airbnb. It is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows using the Python framework.

Airflow web UI.

This allows you to connect to virtually any technology using Airflow. Another important feature is the immaculate web interface that offers precise insights in all the pipelines.

Insights into a pipeline run.

Airflow can be run as a single process on your laptop, in the cloud using virtual machines and even using a distributed setup for instance running in Kubernetes.

ETL pipeline in Airflow.

How to get started:

There are different ways to get started with Airflow. I am listing a few here I think are the most common.

  • locally: docker-compose (Apache Airflow documentation)
  • locally: astro dev start (astronomer.io)
  • Cloud: AWS, GCP, Azure or managed service (astronomer.io)
  • distributed: Kubernetes and Helm charts (self-hosted)
Running Airflow locally using docker-compose.

Recently (Q1 2023) Azure has added an Airflow service to Azure Data Factory, allowing users to create an Airflow instance based on virtual machines.

Why Airflow?

Normally I would write an essay here but with Airflow I’d rather say why not give Airflow a go?

Astronomer.io stats.
  • Airflow has a great web interface
  • it allows to write pipelines in pure Python
  • there are basically extensions for all common use cases
  • it’s extremely popular among the Data Engineering community
  • the community is huge and so is the Slack channel activity
Data from the 2022 Airflow summit. Jobs using Airflow.
Airflow popularity increase over the last 4 years.

https://airflow.apache.org/blog/airflow-survey-2022/

In the next article I will show how a simple ETL job is constructed and deployed locally and in Azure.

Next up: Hands-On Apache Airflow Tutorial

If you found this article useful, please follow me.

--

--

DataFairy

Senior Data Engineer, Azure Warrior, PhD in Theoretical Physics, The Netherlands. I write about Data Engineering, Machine Learning and DevOps on Azure.