Introduction to Apache Airflow
A beginners guide to the industry standard for batch ETL jobs written in Python. Get started with Apache Airflow, the open-source tool you don’t want to miss out on.
What is Airflow?
Airflow started in 2014 as an open-source Github project and in 2015 was further developed at Airbnb. It is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows using the Python framework.
This allows you to connect to virtually any technology using Airflow. Another important feature is the immaculate web interface that offers precise insights in all the pipelines.
Airflow can be run as a single process on your laptop, in the cloud using virtual machines and even using a distributed setup for instance running in Kubernetes.
How to get started:
There are different ways to get started with Airflow. I am listing a few here I think are the most common.
- locally: docker-compose (Apache Airflow documentation)
- locally: astro dev start (astronomer.io)
- Cloud: AWS, GCP, Azure or managed service (astronomer.io)
- distributed: Kubernetes and Helm charts (self-hosted)
Recently (Q1 2023) Azure has added an Airflow service to Azure Data Factory, allowing users to create an Airflow instance based on virtual machines.
Why Airflow?
Normally I would write an essay here but with Airflow I’d rather say why not give Airflow a go?
- Airflow has a great web interface
- it allows to write pipelines in pure Python
- there are basically extensions for all common use cases
- it’s extremely popular among the Data Engineering community
- the community is huge and so is the Slack channel activity
https://airflow.apache.org/blog/airflow-survey-2022/
In the next article I will show how a simple ETL job is constructed and deployed locally and in Azure.
Next up: Hands-On Apache Airflow Tutorial
If you found this article useful, please follow me.