A Complete Guide to Setting up a Local Development environment for Airflow (Docker, PyCharm, and tests)

Teddy Hartanto
Ninja Van Tech
Published in
12 min readMay 23, 2020
Image credit

Introduction

Note: Theoretically, this setup should work no matter which OS you are using. But we have yet to test it on a Linux nor Windows machine, since all of us here use MacOS.

Six months ago (when the latest Airflow was 1.10.4), I was tasked to set up a local Airflow development environment. I did a bit of research and was not quite satisfied with the solutions out there. There did not seem to be a widely agreed-upon standard. Some were running Airflow on their local machine in a manner similar to the instructions in Airflow’s Quick Start. Others were using docker-compose, which is more reasonable. But even then, I could not find a setup that sufficiently meets the criteria below. We wanted a local development environment that is:

  • Lightweight
  • Closely mimics the production environment
    We run Airflow in our Kubernetes cluster, like the rest of our microservices
  • Easy and fast to start up
  • Easy to work with
  • Promotes collaboration

Eventually, we built our own complete setup — with the IDE interacting seamlessly with our Docker environment, as well as some basic DAG validation tests.

In this article, I am going to share a setup that has worked pretty well for my team here in Ninja Van. I am going to lay out the thoughts and processes that went into setting up the local development environment, too. This article should be read in tandem with the boilerplate code that I have assembled and hosted at:

https://github.com/ninja-van/airflow-boilerplate

By the end of this article, you would have a better understanding of the local Airflow development environment, consisting of:

  1. A Docker environment containing the Airflow DB backend, Scheduler, and Webserver, for:
    a) Running a single DAG, or complex workflows involving multiple DAGs
    b) Experimenting with Airflow configurations in airflow.cfg
    c) Adding Airflow connections/variables
  2. A Local Environment, for:
    a) Running a single task instance
    b) Running your tests
  3. (BONUS) Simple tests that you can include in your CI/CD pipeline to catch errors early on

We also use these tools listed below in our development environment:

  1. Virtual environments
  2. black (formatter), flake8 (linter), pre-commit hook

However, I am not covering them to keep the article short. I highly recommend you to look at those tools for better Developer Experience. I have included their setup in the boilerplate.

Pre-requisites

I shall assume that:

  1. You have a basic working knowledge of docker & docker-compose. You also have them installed locally on your machine. To install them, visit:
    - https://docs.docker.com/install/
    - https://docs.docker.com/compose/install/
  2. You have a basic working knowledge of Airflow and its components: Airflow DB, Scheduler, Webserver, Connections, Variables, DAGs, etc.

Now, let us get started!

First of , let us see how Airflow out-of-the-box setup fares against the criteria mentioned in the Introduction.

Airflow out-of-the-box setup: good for playing around

Based on the Quick Start guide, here is what we need to do to get started.

# airflow needs a home, ~/airflow is the default,
# but you can lay foundation somewhere else if you prefer
# (optional)
export AIRFLOW_HOME=~/airflow

# install from pypi using pip
pip install apache-airflow

# initialize the database
airflow initdb

# start the web server, default port is 8080
airflow webserver -p 8080

# start the scheduler
airflow scheduler

# visit localhost:8080 in the browser and enable the example dag in the home page

Notice the following.

  1. A default airflow.cfg is generated and placed at $AIRFLOW_HOME
  2. A sqlite database airflow.db is initialised and placed at $AIRFLOW_HOME. Note that using the default sqlite database narrows down the choice of Airflow executor to only SequentialExecutor. The implication is that there will not be any concurrent task execution. This is different from our actual production environment where tasks are executed concurrently. Thus, if we stick to SequentialExecutor, we might run into problems that only emerge in the production environment.
  3. To get a working environment for airflow, we need to run 3 components:
    a) DB backend (essential*)
    b) Webserver (optional*)
    c) Scheduler (optional*)

*Note: Webserver & Scheduler are optional because you can use the Airflow CLI to run tasks. The Airflow CLI doesn’t depend on the Webserver or Scheduler, but it depends on the DB backend.

The out-of-the-box setup is good enough if you just want to have a taste of Airflow. But, we think it is not suitable as a development environment because:

  • It does not closely resemble our production environment, given the SequentialExecutor constraint
  • It is not easy or fast to set up with the multiple commands needed to run the whole suite of Airflow components. We have not even taken into account moving out of sqlite. That would incur an additional step to start up a local RDBMS.

It is because of those drawbacks that we decided to build our own development environment.

The first decision was whether to use minikube or docker-compose. Initially, we considered minikube because it was as close we could get to our actual production environment. But, eventually we went with docker-compose, because it seems to be more lightweight and friendly. Exposing the Kubernetes side of things would steepen the learning curve of the project.

After a fair bit of research and iteration, we ended up with the following project structure:

airflow-boilerplate/        # project root, the $AIRFLOW_HOME
dags/ # airflow dags
docker/ # for local dev env
spark-conf/
docker-compose.yml
Dockerfile
entrypoint.sh
plugins/ # custom airflow plugins
tests/ # tests
variables/ # airflow variables, for documentation
.gitignore
airflow.cfg # airflow config
env.sh # setting env vars for local testing
requirements-airflow.txt

The setup can be divided into 3 sections:

  1. The Docker Environment, where I will mainly cover the Dockerfile , entrypoint.sh , and docker-compose.yml
  2. The Local Environment, where I will cover env.sh and PyCharm setup
  3. The DAG Validation tests
The overall setup

The Docker Environment

Main interface: Airflow UI

Our docker image extends upon the puckel/docker-airflow image. This was way before Airflow introduced a production Docker image support in 1.10.10.

Inside the docker/Dockerfile :

# NOTE: paths are relative to the project root since the build context we specify is the project root

FROM puckel/docker-airflow:1.10.8
USER root

# .............
# EXTRA STUFFS
# .............
COPY docker/entrypoint.sh /entrypoint.sh
RUN rm $AIRFLOW_HOME/airflow.cfg

COPY requirements-airflow.txt requirements.txt
RUN pip3 install -r requirements.txt


# .............
# EXTRA STUFFS
# .............
USER airflow

What we have done differently from puckel/docker-airflow here are as follow.

  1. We use a custom entrypoint.sh instead of the one that came with puckel/docker-airflow
  2. We install the python requirements during the image build time, instead of doing it in the entrypoint.sh, which was the case for puckel/docker-airflow. As we have pointed out earlier, we wanted a local development environment that is fast to set up. Installing the requirements in the entrypoint would prolong the startup time of our development environment. The trade-off is that whenever we have a new PyPi package, we need to rebuild the image.

Inside the docker/entrypoint.sh :

#!/usr/bin/env bash

# Modified from the original version:
# https://github.com/puckel/docker-airflow/blob/master/script/entrypoint.sh

TRY_LOOP="20"

: "${REDIS_HOST:="redis"}"
: "${REDIS_PORT:="6379"}"
: "${REDIS_PASSWORD:=""}"

: "${POSTGRES_HOST:="airflow_postgres"}"
: "${POSTGRES_PORT:="5432"}"
: "${POSTGRES_USER:="airflow"}"
: "${POSTGRES_PASSWORD:="airflow"}"
: "${POSTGRES_DB:="airflow"}"

# Defaults and back-compat
: "${AIRFLOW_HOME:="/usr/local/airflow"}"

export \
AIRFLOW_HOME \
AIRFLOW__CELERY__BROKER_URL \
AIRFLOW__CELERY__RESULT_BACKEND \
AIRFLOW__CORE__LOAD_EXAMPLES \
AIRFLOW__CORE__SQL_ALCHEMY_CONN \


wait_for_port() {
local name="$1" host="$2" port="$3"
local j=0
while ! nc -z "$host" "$port" >/dev/null 2>&1 < /dev/null; do
j=$((j+1))
if [ $j -ge $TRY_LOOP ]; then
echo >&2 "$(date) - $host:$port still not reachable, giving up"
exit 1
fi
echo "$(date) - waiting for $name... $j/$TRY_LOOP"
sleep 5
done
}

wait_for_port "Postgres" "$POSTGRES_HOST" "$POSTGRES_PORT"

case "$1" in
webserver)
airflow scheduler &
exec airflow webserver
;;
worker|scheduler)
# To give the webserver time to run initdb.
sleep 10
exec airflow "$@"
;;
flower)
sleep 10
exec airflow "$@"
;;
version)
exec airflow "$@"
;;
*)
# The command is something like bash, not an airflow subcommand. Just run it in the right environment.
exec "$@"
;;
esac

Here, besides removing all the bells and whistles that came with puckel/docker-airflow ‘s entrypoint.sh, we have:

  1. Removed most AIRFLOW -related environment variables. They should be set in airflow.cfg. By mounting the airflow.cfg into the Docker container, we can enjoy the benefits of:
    a) Changing the configurations in airflow.cfg and having the Webserver & Scheduler instantly pick up the changes (hot-reload)
    b) Having a fixed value for $AIRFLOW__CORE__FERNET_KEY regardless of image rebuilding. This fernet key is used in the encryption of sensitive metadata in the DB backend
  2. Removed airflow initdb, for reasons that will be explained later

Make sure to make the docker/entrypoint executable:

chmod +x docker/entrypoint.sh

Inside the docker/docker-compose.yml :

version: '3.7'
services:
airflow_postgres:
image: postgres:9.6.2
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
volumes:
- "airflow_dbdata:/var/lib/postgresql/data"
ports:
- "5432:5432"

airflow_initdb:
build:
context: ..
dockerfile: docker/Dockerfile
depends_on:
- airflow_postgres
volumes:
- ../airflow.cfg:/usr/local/airflow/airflow.cfg
- ../variables:/usr/local/airflow/variables
command:
- /bin/bash
- -c
- |
airflow initdb
if [[ -e /usr/local/airflow/variables/dev/all.json ]]; then
airflow variables -i /usr/local/airflow/variables/dev/all.json
fi
# Enable this if you choose to have RBAC UI activated in the webserver
# airflow create_user -r Admin -u airflow -e airflow@airflow.com -f Air -l Flow -p airflow

airflow_webserver:
build:
context: ..
dockerfile: docker/Dockerfile
restart: always
depends_on:
- airflow_initdb
volumes:
- ../airflow.cfg:/usr/local/airflow/airflow.cfg
- ../dags:/usr/local/airflow/dags
- ../plugins:/usr/local/airflow/plugins
- ./spark-conf:/spark-conf
ports:
- "8080:8080"
- "4040:4040"
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3

volumes:
airflow_dbdata:

Notable points here:

  1. An airflow_postgres DB backend. When you run this setup, you will have a postgres DB running at localhost:5432
  2. An airflow_initdb container, responsible for the initialisation step — things like airflow initdb or airflow variables ... to initialise all the airflow variables we have documented in airflow/variables, rather than having to manually add them.
  3. An airflow_webserver container, which runs both the Airflow Scheduler & Webserver. We have configured volume mounts for airflow.cfg , dags , and plugins . Effectively, this means new changes in the DAGs/plugins/configuration will be registered by Airflow, during its runtime. When you run this setup, you will have access to your Airflow UI at localhost:8080
  4. An airflow_dbdata persistent volume, because we do not want to lose the connections that we have set. Notice that we did not do any initialisation on the Airflow connections because they contain sensitive data. As we develop new workflows, we will add new connections. Since we do not want to lose those metadata, we use the Docker persistent volume.

Inside the airflow.cfg (refer to the boilerplate), we specify the sql_alchemy_conn to point to our local airflow_postgres container:

sql_alchemy_conn = postgresql+psycopg2://airflow:airflow@airflow_postgres:5432/airflow

Last but not least, the requirements-airflow.txt :

apache-airflow[postgres]==1.10.8
cryptography==2.9.2
werkzeug==0.16.1

With these, now we have a working Docker environment where we can run complex DAGs, experiment with airflow.cfg , and make use of the Airflow UI.

To run the complete set of Airflow components (DB, scheduler, webserver), in your project root, run:

docker-compose -f docker/docker-compose.yml up -d

To run only the DB (with initialisation):*

docker-compose -f docker/docker-compose.yml up -d airflow_initdb

*Note: if you are only interested in using the Local Environment, running only the DB is sufficient.

The Local Environment

Main interface: Airflow CLI

The main components of the Local Environment

Let us take a look at env.sh :

# Set these env vars so that:
# 1. airflow commands locally are run against the docker postgres
# 2. `airflow test` runs properly

AIRFLOW_HOME=$(pwd)

# Note: env vars set here should be printed out in the print statement below
export
AIRFLOW_HOME
export AIRFLOW__CORE__EXECUTOR=SequentialExecutor
export AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@localhost:5432/airflow
export AIRFLOW__CORE__FERNET_KEY=mkA0ggJccF5BSlGBIY5adyXAyPqpYizW9KhdJFjgdaQ=
export AIRFLOW__CORE__DAGS_FOLDER=$AIRFLOW_HOME/dags
export AIRFLOW__CORE__PLUGINS_FOLDER=$AIRFLOW_HOME/plugins
export AIRFLOW__CORE__BASE_LOG_FOLDER=$AIRFLOW_HOME/logs
export AIRFLOW__CORE__DAG_PROCESSOR_MANAGER_LOG_LOCATION=$AIRFLOW_HOME/logs/dag_processor_manager/dag_processor_manager.log
export AIRFLOW__SCHEDULER__CHILD_PROCESS_LOG_DIRECTORY=$AIRFLOW_HOME/logs/scheduler

print \
"===================================================================
Environment Variables for Local Execution (vs execution in Docker)
===================================================================
AIRFLOW_HOME=$AIRFLOW_HOME
AIRFLOW__CORE__EXECUTOR=$AIRFLOW__CORE__EXECUTOR
AIRFLOW__CORE__SQL_ALCHEMY_CONN=$AIRFLOW__CORE__SQL_ALCHEMY_CONN
AIRFLOW__CORE__FERNET_KEY=$AIRFLOW__CORE__FERNET_KEY
AIRFLOW__CORE__DAGS_FOLDER=$AIRFLOW__CORE__DAGS_FOLDER
AIRFLOW__CORE__PLUGINS_FOLDER=$AIRFLOW__CORE__PLUGINS_FOLDER
AIRFLOW__CORE__BASE_LOG_FOLDER=$AIRFLOW__CORE__BASE_LOG_FOLDER
AIRFLOW__CORE__DAG_PROCESSOR_MANAGER_LOG_LOCATION=$AIRFLOW__CORE__DAG_PROCESSOR_MANAGER_LOG_LOCATION
AIRFLOW__SCHEDULER__CHILD_PROCESS_LOG_DIRECTORY=$AIRFLOW__SCHEDULER__CHILD_PROCESS_LOG_DIRECTORY
"

The main purpose of env.sh is so that the Airflow CLI will:

  1. Make this project the $AIRFLOW_HOME
  2. Connect to the DB Backend (our Docker Postgres DB)

Notice that:

  1. The Docker and the Local Environment both share the same DB backend
  2. The DB backend has to be up, no matter which environment you wish to use.

Similarly important is the .gitignore , where we ignore the miscellaneous files that Airflow output during runtime:

# stuffs that airflow output during runtime
logs/
unittests.cfg

With these, in your project root, you can now use the Airflow CLI as such:

source env.sh && airflow test <dag_id> <task_id> <execution_date> 

to run a single task instance! But for 10x productivity, read on.

PyCharm setup

1. Make sure that your Project Interpreter is pointing to the correct virtual environment and you have all your project requirements installed — airflow, etc

Project interpreter pointing to the virtual environment, with all dependencies installed

2. Mark both dags/ and plugins/ directories as source:
The reason for this is because, when you run any airflow command, it will mount $AIRFLOW__CORE__DAGS_FOLDER and $AIRFLOW__CORE__PLUGINS_FOLDER to your PYTHONPATH , and because our IDE does static analysis, we need to add these to the PYTHONPATH on our own.

Mark dags and plugins directories as “Sources Root”

3. On the terminal, run source env.sh and copy the environment variables to your clipboard

Run env.sh and copy the env vars

4. Add a new Run/Debug Configuration with the following parameters:
a) Name: <whatever_you_want>
b) Script path: <path_to_your_virtualenv_airflow_executable>
c) Parameters: test <dag_id> <task_id> <execution_date>
d) Environment variables: paste your env vars here
Now you can run the task instance you want! :-) Just duplicate this configuration for any new task

Run/debug configurations

5. Also, add those environment variables to your test configuration template (pytest in my case), so that you can just hit the run/debug button next to your test functions

pytest template with environment variables filled in

Now, you are good to go :-)

(Bonus) DAG validation tests

In tests/conftest.py :

import os
import pytest

from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.models import DagBag
from pyspark.sql import SparkSession

DAGS_FOLDER = os.environ["AIRFLOW__CORE__DAGS_FOLDER"]


@pytest.fixture(scope="session")
def dag_bag():
return DagBag(dag_folder=DAGS_FOLDER, include_examples=False)


@pytest.fixture
def dag():
default_args = {"owner": "airflow", "start_date": days_ago(1)}
return DAG("test_dag", default_args=default_args, schedule_interval="@daily")

In tests/test_dags.py we do a global check on all DAGS:

"""Test the validity of dags."""


def test_import_errors(dag_bag):
"""
Tests that the DAG files can be imported by Airflow without errors.
ie.
- No exceptions were raised when processing the DAG files, be it timeout or other exceptions
- The DAGs are indeed acyclic
DagBag.bag_dag() checks for dag.test_cycle()
"""
assert len(dag_bag.import_errors) == 0


def test_dags_has_task(dag_bag):
for dag in dag_bag.dags.values():
assert len(dag.tasks) > 0

To run the tests:

source env.sh && pytest -v

Caveats

  • As mentioned previously, the PyPi packages are installed during build time instead of run time to minimise the start-up time of our development environment. As a side-effect, if there is any new PyPi packages, the images need to be rebuilt. You can do so by passing the extra--build flag:
docker-compose -f docker/docker-compose.yml up -d --build
  • PyCharm cannot recognise custom plugins. That is because an IDE does static analysis, while the custom plugins are registered dynamically by Airflow during runtime.
PyCharm failing to recognise custom plugin
  • Not related to the development environment, but rather how Airflow works — some of the configurations (like rbac = True) you change in airflow.cfg might not be reflected immediately on runtime. That is because they are static configurations and are evaluated only once during in the startup time. To solve that problem, just restart your webserver:
docker-compose -f docker/docker-compose.yml restart airflow_webserver
  • Not related to the development environment, but rather how Airflow works — you cannot have a package/module in dags/ and plugins/ with the same name. This will likely give you a ModuleNotFoundError

There you have it! :-)

You can access this setup at:
https://github.com/ninja-van/airflow-boilerplate

Leave some comments below if you have any questions or suggestions on how it can be improved! Let me know if this has helped you!

Many thanks to my colleagues at Ninja Van for the inputs that went into the shaping of this article. Special thanks to my colleagues in the Data team (SK Sim, Kat Guevara, and Luqi Chen) for the valuable feedback on the development environment.

--

--