Write Code in Airflow Within Minutes

David Smith
The Startup
Published in
6 min readSep 5, 2020
Photo by Rabih Shasha on Unsplash

Airflow is a powerful and flexible workflow automation and scheduling system, powered by Python with a rich set of integrations and tools available out-of-the-box.

Although the vast amount of documentation and large configuration files make learning Airflow look like a daunting task, it is easy to get either a simple or more complex configuration up and running quickly for developers to begin writing code and learn how the product actually works. In this article I will demonstrate how to use Airflow with Docker to achieve this using a public Github repo I authored here:

https://github.com/suburbanmtman/airflow-intro

Simplified Airflow Concepts

DAG — directed acyclic graph. How to run a workflow, visible in the dashboard on the web interface.

Worker — one or more systems responsible for running the code in the workflow task queues

Webserver — displays UI for managing Airflow, manages user requests for running tasks, and receives updates from DAG runs via workers

Scheduler — determines if a Task needs to be run and triggers work to be processed by a Worker

Operator — a step of what actually gets run inside a DAG

Task — an instantiated Operator created by the scheduler, a single unit of work

Task Instance — stored state of a task

Step 1: Simple Architecture

The directory https://github.com/suburbanmtman/airflow-intro/tree/master/versions/simple contains the Docker compose configuration to run a simple version of the Airflow architecture containing all necessary components on a single system.

Simple Architecture — one Docker image containing the Airflow Webserver, Scheeduler, and a SQLite database
Single Docker image containing the Airflow webserver, scheduler, and a SQLite database on disk

The simple architecture combines the Airflow webserver, scheduler, and SQLite database on a single Docker container. The scheduler is configured as a SequentialExecutor, meaning the scheduler will execute tasks itself in a single python process without needing any workers running. This definitely will not scale in a production environment, but the simplicity will allow us to focus on how the system works.

With Docker running, navigate to the versions/simple directory in your environment and run the following to start the environment:

docker-compose up -d --build

This will start the container orchestration defined in https://github.com/suburbanmtman/airflow-intro/blob/master/versions/simple/docker-compose.yml, which is just the single container forwarding the webserver port to localhost 8080. Navigate to http://localhost:8080 to verify the server is running.

Airflow dashboard
Airflow Dashboard with examples

There are already several DAGs listed on the dashboard, a_simple_dag was imported from our project and the others are built-in examples loaded from Airflow when enabled in the airflow.cfg.

Let’s run a_simple_dag to see things work. Slide the Off button to the left of the name to On and click the Play button under links to run the DAG (first icon). After a few minutes refresh the page — the DAG run should be dark green (success).

So what happened?

1. A manual DAG run was written to the SQLite database from the webserver after the Play button was clicked

2. The scheduler ran the DAG in a python process, where some Operator were executed as Tasks determined by the DAG decision tree.

3. The DAG reached a terminal success state

To dive in further, click on a_simple_dag to see a tree view of the latest DAG runs.

Tree view of latest DAG runs

This breaks down the outcomes of each operator in each DAG run. In this case some were success and some were skipped. Why were some Operators skipped? Click on Graph View in the top menu for a better view of the DAG logic.

Graph view of DAG

This view shows how the Operators are related in the DAG as well as their outcomes. This DAG contains three different Operators:

1. DummyOperator — does nothing besides mark state
2. PythonOperator — runs a method of python code
3. BranchPythonOperator — runs a method of python code similar to PythonOperator, except the return value determines the next Task to run.

Click on the Code menu item to see the actual source code of the DAG (this will match the DAG code on your disk). Here’s an overview of the DAG:

  1. A random number is generated in random_number_generator PythonOperator task
  2. print_number is a BranchPythonOperator task that prints the number generated in the previous Task and returns negative_number , zero , or positive_number based on the value.
  3. DummyOperators negative_number , zero , and positive_number are run based on the Task returned in the previous print_number

Given the logic of the DAG only one of the DummyOperators will be followed based on the number result, and the other two will be skipped. We can verify the right choice was made in this run by clicking on the print_number task and View Log.

Try running the DAG a few more times, odds are each of the DummyOperator tasks will be run in at least one DAG run.

Since the DAG source code is linked to the Docker container using volumes try making a change on your own to the python DAG file and running it.

Congratulations, you now have a running local Airflow environment that allows you to immediately see the feedback from your python code changes!

Step 2: Distributed Architecture with Celery

The directory https://github.com/suburbanmtman/airflow-intro/tree/master/versions/celery contains a more production-like architecture of airflow in its Docker compose configuration.

Distributed Airflow Architecture using Celery, PostgresSQL, and Redis
  • PostgreSQL is now the metadata database for Airflow and task state and runs its own container.
  • Airflow Webserver and Scheduler are each on their own container and
    use PostgreSQL to store and retrieve DAG and Task status.
  • Two Airflow workers are now responsible for running the tasks, as opposed to the previous example where the scheduler was configured to execute them itself sequentially. With this configuration the scheduler uses the Celery Executor to communicate with the worker processes, allowing the workers to process multiple tasks concurrently. Two are chosen arbitrarily to demonstrate how multiple workers can exist in the environment but many more can easily be configured in the cloud for higher throughput.
  • Redis is used as the Celery task broker and can be monitored directly using the Airflow flower UI on http://localhost:5555 to verify the workers are listening.
Airflow Flower Dashboard

Stop the previous docker compose environment and navigate to versions/celery to start the docker compose environment to load all of the containers. Navigate to http://localhost:8080 once the webserver has started to view the Airflow UI, and try running a_simple_dag a few more times. Although the tasks and logs are written by different containers, Airflow conveniently groups them together in the UI to avoid having to check individual worker containers when investigating runs.

As you can see, the experience of running the DAGs in Airflow and the python code for the DAG is exactly the same as the first architecture — in fact it is pointing to the same exact python module in the repo! Using either of these docker compose configurations as a starting point, you can have a running Airflow system in your local environment and can dive into writing your own production-ready python code.

--

--

David Smith
The Startup

Data, Software, Cloud Architecture consultant at Gentle Valley Digital building scalable big data and SaaS solutions.