Airflow is a powerful and flexible workflow automation and scheduling system, powered by Python with a rich set of integrations and tools available out-of-the-box.
Although the vast amount of documentation and large configuration files make learning Airflow look like a daunting task, it is easy to get either a simple or more complex configuration up and running quickly for developers to begin writing code and learn how the product actually works. In this article I will demonstrate how to use Airflow with Docker to achieve this using a public Github repo I authored here:
Simplified Airflow Concepts
DAG — directed acyclic graph. How to run a workflow, visible in the dashboard on the web interface.
Worker — one or more systems responsible for running the code in the workflow task queues
Webserver — displays UI for managing Airflow, manages user requests for running tasks, and receives updates from DAG runs via workers
Scheduler — determines if a Task needs to be run and triggers work to be processed by a Worker
Operator — a step of what actually gets run inside a DAG
Task — an instantiated Operator created by the scheduler, a single unit of work
Task Instance — stored state of a task
Step 1: Simple Architecture
The directory https://github.com/suburbanmtman/airflow-intro/tree/master/versions/simple contains the Docker compose configuration to run a simple version of the Airflow architecture containing all necessary components on a single system.
The simple architecture combines the Airflow webserver, scheduler, and SQLite database on a single Docker container. The scheduler is configured as a SequentialExecutor, meaning the scheduler will execute tasks itself in a single python process without needing any workers running. This definitely will not scale in a production environment, but the simplicity will allow us to focus on how the system works.
With Docker running, navigate to the
versions/simple directory in your environment and run the following to start the environment:
docker-compose up -d --build
This will start the container orchestration defined in https://github.com/suburbanmtman/airflow-intro/blob/master/versions/simple/docker-compose.yml, which is just the single container forwarding the webserver port to localhost 8080. Navigate to http://localhost:8080 to verify the server is running.
a_simple_dag to see things work. Slide the
Off button to the left of the name to
On and click the
Play button under links to run the DAG (first icon). After a few minutes refresh the page — the DAG run should be dark green (success).
So what happened?
1. A manual DAG run was written to the SQLite database from the webserver after the
Play button was clicked
2. The scheduler ran the DAG in a python process, where some Operator were executed as Tasks determined by the DAG decision tree.
3. The DAG reached a terminal success state
To dive in further, click on
a_simple_dag to see a tree view of the latest DAG runs.
This breaks down the outcomes of each operator in each DAG run. In this case some were
success and some were
skipped. Why were some Operators skipped? Click on
Graph View in the top menu for a better view of the DAG logic.
This view shows how the Operators are related in the DAG as well as their outcomes. This DAG contains three different Operators:
1. DummyOperator — does nothing besides mark state
2. PythonOperator — runs a method of python code
3. BranchPythonOperator — runs a method of python code similar to PythonOperator, except the return value determines the next Task to run.
Click on the
Code menu item to see the actual source code of the DAG (this will match the DAG code on your disk). Here’s an overview of the DAG:
- A random number is generated in
print_numberis a BranchPythonOperator task that prints the number generated in the previous Task and returns
positive_numberbased on the value.
positive_numberare run based on the Task returned in the previous
Given the logic of the DAG only one of the DummyOperators will be followed based on the number result, and the other two will be skipped. We can verify the right choice was made in this run by clicking on the
print_number task and
Try running the DAG a few more times, odds are each of the
DummyOperator tasks will be run in at least one DAG run.
Since the DAG source code is linked to the Docker container using
volumes try making a change on your own to the python DAG file and running it.
Congratulations, you now have a running local Airflow environment that allows you to immediately see the feedback from your python code changes!
Step 2: Distributed Architecture with Celery
The directory https://github.com/suburbanmtman/airflow-intro/tree/master/versions/celery contains a more production-like architecture of airflow in its Docker compose configuration.
- PostgreSQL is now the metadata database for Airflow and task state and runs its own container.
- Airflow Webserver and Scheduler are each on their own container and
use PostgreSQL to store and retrieve DAG and Task status.
- Two Airflow workers are now responsible for running the tasks, as opposed to the previous example where the scheduler was configured to execute them itself sequentially. With this configuration the scheduler uses the Celery Executor to communicate with the worker processes, allowing the workers to process multiple tasks concurrently. Two are chosen arbitrarily to demonstrate how multiple workers can exist in the environment but many more can easily be configured in the cloud for higher throughput.
- Redis is used as the Celery task broker and can be monitored directly using the Airflow flower UI on http://localhost:5555 to verify the workers are listening.
Stop the previous docker compose environment and navigate to
versions/celery to start the docker compose environment to load all of the containers. Navigate to http://localhost:8080 once the webserver has started to view the Airflow UI, and try running
a_simple_dag a few more times. Although the tasks and logs are written by different containers, Airflow conveniently groups them together in the UI to avoid having to check individual worker containers when investigating runs.
As you can see, the experience of running the DAGs in Airflow and the python code for the DAG is exactly the same as the first architecture — in fact it is pointing to the same exact python module in the repo! Using either of these docker compose configurations as a starting point, you can have a running Airflow system in your local environment and can dive into writing your own production-ready python code.