Apache Airflow: You need to know this before getting started
I assume you already know this since you are here, Apache Airflow is a workflow orchestration and scheduling platform, majorly for your data pipelines, but is not limited to just that.
Since Airflow is pure python, it’s as versatile as python can get. It’s scalable, pluggable and highly customisable with very good community support.
It’s also one of the very few tools dedicated to high-performance scheduling and orchestration.While it has its pros and cons, which we’ll discuss in a while, let’s first see how Airflow works.
This article doesn’t teach how to create workflows in detail, Airflow documentation does that well.
Just to give a brief overview of Airflow, a workflow is called a DAG(Directed Acyclic Graph) which will consist of tasks and the relationships between tasks. A Task is the smallest unit of the workflow, which is performing some action.
Each task is a form of an operator, for example PythonOperator is for running a python function, BashOperator is for running a bash command and DummyOperator is for basically doing nothing. Apart from these, Airflow has inbuilt support for numerous operators, which provides robust integrations with external systems.
Now, let’s see how Airflow looks under the hood. Airflow is made up of the following components:
- Scheduler and Executer: For executing tasks at the right time. You have plenty of options here to choose from, CeleryExecuter and KubernetesExecuter are the popular ones.
- Worker: For actually running the tasks.
- Metadata database: For maintaining the application state information as well as historical metadata
- WebServer: For the Airflow UI
- DAG Directory: For storing all the DAG code, accessible via scheduler, workers and webserver. If you are using a container based system, Dag files are baked into the Airflow docker image itself at the time of deployment.
Although this architecture diagram shows multiple instances of workers and but not for the others, scheduler and webserver are also horizontally scalable in the newer versions of Airflow.
The only limitation that I can see is in terms of scaling the database. But anyway, most of the metadata is historical and you probably won’t need it for a long time, you should be regularly purging the database.
- Highly customisable, both in terms of functionality as well as infrastructure. You can check this configuration reference document to see what all you can control and configure yourself.
- A lot of providers for major cloud platforms and technologies for easy and robust integration.
- Good community support with very active maintenance. Also backed by the Apache Foundation.
- As Airflow provides configuration as Code, it supports Dynamic DAG creation i.e. code that creates and runs pipelines dynamically.
- Highly scalable and can be easily hosted on Kubernetes for better orchestration. Airflow provides the official helm chart for deploying to Kubernetes, which makes things very easy.
- Big enterprized get skeptical of using open source tools because of the operational overhead and security concerns. But in case of Airflow, You’ll be able to find enterprise solution which are native incarnations of Airflow from GCP (Cloud Composer), AWS (MWAA), Astronomer.io etc. It’s good to see these big cloud companies taking bets on Airflow, ultimately providing more choices to the consumers.
- It provides a nice user interface, which has a lot of good features and provides good insights on your workflows.
Too many components and concepts to learn.
You’ll need to have at least a basic understanding of how each component interacts with each other in order to properly work with Airflow.
Configuration overhead right from the start.
With all the customizations and flexibility that Airflow provides comes a big configuration overhead. And the worst part is, there is no right configuration, it always depends on your use case. So, you’ll either have to spend a long time doing trial and error, or you’ll have to spend hours trying to find similar use cases.
Scheduling is a bit complicated.
I find the Airflow scheduling logic to be a little complicated with all the combinations of start_date, execution_date, schedule_intervals, catchup and how each will affect the actual dag runs.
Too much dependency on the metadata database
Since Airflow uses the database as the single point of truth, all the workers, schedulers and webserver nodes connect to the database directly and the load on the database can increase pretty easily as you scale. This also makes the database a single point of failure in the system. I suggest you to use some king of connection pooling here, for example: pgbouncer for postgreSQL.
Airflow itself if secure with all it’s user management and rbac, but under the hood it uses a lot of open-source technologies, some of them regularly face CVEs. Not everyone cares about those, but if you are one of those who do, then you are going to have a bad time dealing with all that. Worst thing is, you can’t control which ones to use and which ones to remove.
It is already established that Airflow is a really good tool for scheduling and orchestration purposes. You can easily test and run few pipelines in your local system with a simple docker-compose file.
If you are already using GCP or AWS you can try their Airflow services in production as well without the headache of completely managing the underlying infrastructure. Astronomer.io is a really good option too, with it being cloud agnostic and the support they can provide.
If you are going with the route of managing everything yourself, you’ll need a fair understanding of how to deploy and manage Airflow. And to be honest, that only comes with experience. So, either you’ll need to hire an expert or upskill yourself with time.
As you scale, there are going to be issues, especially if you are not following the best practices, or if you are using it for what it’s not meant to do. For example, running compute-heavy tasks. Airflow is meant for scheduling and orchestration only, which it does really well, rest everything needs to be outsourced.
With this, I’ll finish this article here. You might find it a little bit opinionated. If you agree or disagree with this, I would love to hear your thoughts. If you want to reach out personally, drop me a note at email@example.com.