Airflow 2.0 Docker Development Setup (Docker Compose, PostgreSQL)

Airflow setup or migration to the newest Airflow 2.0 can be time-consuming and get complicated fast. In this tutorial, the AVA team will take you through the quick and easy setup of the latest Airflow version to unlock some serious performance gains.

Nikola Tomic

Published in

AVA Information

4 min readMay 5, 2021

A quick introduction

Apache Airflow is an open-source platform used for scheduling, monitoring, and workflow management. It can be used to automate a wide range of tasks in a highly scalable manner and is very extensible. Those are only a few of the reasons why we in AVA, alongside companies like Tesla, Airbnb, Slack, and many more use Airflow.

As a Data Scientist at AVA Information Systems, running and maintaining pipelines with Machine Learning models training over large volumes of data is a daily routine and can definitely become time-consuming and repetitive. This is where Airflow comes to the rescue.

Airflow 2.0 improvements

There are a lot of new things coming with the latest Airflow, including UI redesign, Task Groups, Functional DAGs, Security improvements which are good reasons to consider a migration. But the biggest reason why we migrated are massive performance improvements Airflow 2.0 promised and delivered in practice.

Prerequisites

Solution Architecture

To give a better perspective on what we will be doing, drawn below is a bird’s eye view of the proposed architecture. Although it might look more complicated than it is.

Airflow 2.0 image

In this step, we will shed quite a bit of weight from Airflow’s original image.

Important: This step is optional, and if you’re in a rush you can visit a git repository for a single command running and configuration instructions. In this post I will explain each step of building the entire architecture above in a bit more detail.

If you need to be efficient and you want a lean image, in the next few steps we will go through how to remove unnecessary dependencies from the official image. First, let’s clone Airflow’s latest stable version from the git repository:

git clone -b 2.0.1 https://github.com/apache/airflow.git

Now open a Dockerfile (not Dockerfile.ci) from cloned repository in your favorite editor. We will now remove all unnecessary dependencies from AIRFLOW_EXTRAS (line 37). Also, you can change Python to specific version by changing PYTHON_BASE_IMAGE and PYTHON_MAJOR_MINOR_ VERSION (lines 47 and 48). After some changes, in our case, the file should start like in the gist below:

First part of Dockerfile

By simple editing of AIRFLOW_EXTRAS alone, we reduced image size by almost 300MB. Of course, you can go even further and make the image smaller if you wish. Now let’s build the image and tag it as airflow-slim:

docker build -t airflow-slim .

Development image

Why are we building a development image? Our development image will extend the airflow-slim image we created previously. We will use this image to add additional dependencies required for a specific use case. Create a new Dockerfile in your chosen project directory. In our use case, we’ll install additional dependencies using apt-get, and later install additional Python dependencies from the requirements.txt file.

Development Dockerfile

Note that we have to change the USER to root to have sudo privileges for installing dependencies with apt-get and switch back to airflow user to install Python dependencies. Now let’s build and tag the image:

docker build -t airflow-dev .

Building Docker Compose file

Our last step is building a docker-compose file to orchestrate containers as shown in the architecture image. First, make sure the project has the following structure:

root/
├── dags/
├── scripts/
│   └── entrypoint.sh
├── tests/
├── logs/
├── .env
└── docker-compose.yml

Next up, we will quickly go through the contents of each file shown above and explain what they do.

Here we created separate services for the airflow webserver, scheduler, and our database (PostgreSQL 13). Volumes are made to reflect local code changes to the container. Since Airflow allows hot reloading for dags, there is no need for a restart, which speeds up development a lot. Also, we expose port to access webserver through the new web UI on localhost:8080.

docker-compose.yml

In entrypoint script, we have commands that the scheduler service will execute on the run which include database initialization, creating an admin account for web UI login (user: admin, password: admin), and starting the scheduler itself.

entrypoint.sh

Environment or .env file is where we can set airflow running variables to override the default configuration that is generated in airflow.cfg file after the initial run. A complete list of possible configurations and options can be found here. Note that the AIRFLOW__CORE__SQL_ALCHEMY_CONN value has the following pattern:

dialect+driver://username:password@host:port/database

If you want to make any changes to parameters, make sure they are reflected in postgres environment variables in docker-compose.

.env

Now run docker-compose up and the webserver will be running on localhost:8080.

And we are officially done with the setup. Thank you all for staying until the end and I hope this was helpful to you.

For more information and updates from us visit AVA.info

Follow us on LinkedIn & Twitter