Apache Airflow is a tool to create workflows such as an extract-load-transform pipeline on AWS. A workflow is a directed acyclic graph (DAG) of tasks and Airflow has the ability to distribute tasks on a cluster of nodes. Let’s see how it does that.
RabbitMQ is a message broker. Its job is to manage communication between multiple services by operating message queues. It provides an API for other services to publish and to subscribe to the queues.
Celery is a task queue. It can distribute tasks on multiple workers by using a protocol to transfer jobs from the main application to Celery workers. It relies on a message broker to transfer the messages.
Inside Apache Airflow, tasks are carried out by an executor. The main types of executors are:
- Sequential Executor: Each task is run locally (on the same machine as the scheduler) in its own python subprocess. They are run sequentially which means that only one task can be executed at a time. It is the default executor.
- Local Executor: It is the same as the sequential executor except that multiple tasks can run in parallel. It needs a metadata database (where DAGs and tasks status are stored) that supports parallelism like MySQL. Setting such a database requires some extra work since the default configuration uses SQLite.
- Celery Executor: The workload is distributed on multiple celery workers which can run on different machines. It is the executor you should use for availability and scalability.
Distributed Apache Airflow Architecture
Apache Airflow is split into different processes which run independently from each other.
When setting up Apache Airflow with the celery executor to use a distributed architecture, you have to launch a bunch of these processes and other services:
- A metadata database (MySQL): it contains the status of the DAG runs and task instances.
- Airflow web server: a web interface to query the metadata to monitor and execute DAGs.
- Airflow scheduler: checks the status of the DAGs and tasks in the metadata database, create new ones if necessary and sends the tasks to the queues.
- A message broker (RabbitMQ): it stores the task commands to be run in queues.
- Airflow Celery workers: they retrieve the commands from the queues, execute them and update the metadata.
You can look at Clairvoyant blog to set up everything.
In the end, a typical architecture looks like:
Journey of a task instance
Let’s now dive deeper into each step to understand the complete process and to know how to monitor and debug them.
1. In the beginning, the scheduler creates a new task instance in the metadata database with the scheduled state:
"execution_date": "2019-03-24 14:52:56.271392",
2. Then the scheduler uses the Celery Executor which sends the task to the message broker. RabbitMQ queues can be explored from the management UI on port 15672. You can see the number of messages in the queue and open each message.
If you look closely at the payload created by the Celery Executor, you will see that it contains the command that the celery worker should execute:
airflow run tutorial Hello_World 2019-03-24T13:52:56.271392+00:00
3. The celery worker then receives the command from the queue. It is now empty.
Read the full article on Sicara’s blog here.