Airflow Installation Simplified — using Docker Compose/Podman Compose

Raviteja Tholupunoori
Apache Airflow
Published in
5 min readJul 3, 2024

Contents

  • Introduction
  • Installing Apache Airflow
  • Running Docker Compose
  • Interacting with Airflow UI
  • Clean up

Github Link— https://github.com/raviteja10096/Airflow/tree/main/Airflow_Docker

Introduction

Airflow is a popular tool that simplifies the complex workflow. It allows you to programmatically define, schedule, and monitor your workflows, all in one place. While Airflow is a powerful option, installation can sometimes feel overwhelming.

This guide will break down the setup process into two easy-to-follow methods, getting you up and running with Airflow in no time.

Sample Airflow UI:

Installing Airflow:

Airflow Components

We have different services like scheduler, webserver, worker, redis, postgres,flower and postgres which help you to run airflow

The docker-compose.yaml file includes the following service definitions:

  • airflow-scheduler: Manages and schedules tasks and DAGs.
airflow-scheduler:
<<: *airflow-common
command: scheduler
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8974/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
  • airflow-webserver: Hosts the web interface accessible at localhost:8080 .
airflow-webserver:
<<: *airflow-common
command: webserver
ports:
- "8080:8080"
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
  • airflow-worker: Executes tasks assigned by the scheduler.
airflow-worker:
<<: *airflow-common
command: celery worker
healthcheck:
# yamllint disable rule:line-length
test:
- "CMD-SHELL"
- 'celery --app airflow.providers.celery.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}" || celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
environment:
<<: *airflow-common-env
# Required to handle warm shutdown of the celery workers properly
# See https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation
DUMB_INIT_SETSID: "0"
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
  • airflow-init: Initializes the Airflow setup.
airflow-init:
<<: *airflow-common
entrypoint: /bin/bash
# yamllint disable rule:line-length
command:
- -c
- |
if [[ -z "${AIRFLOW_UID}" ]]; then
echo
echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m"
echo "If you are on Linux, you SHOULD follow the instructions below to set "
echo "AIRFLOW_UID environment variable, otherwise files will be owned by root."
echo "For other operating systems you can get rid of the warning with manually created .env file:"
echo " See: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#setting-the-right-airflow-user"
echo
fi
one_meg=1048576
mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg))
cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)
disk_available=$$(df / | tail -1 | awk '{print $$4}')
warning_resources="false"
if (( mem_available < 4000 )) ; then
echo
echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m"
echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))"
echo
warning_resources="true"
fi
if (( cpus_available < 2 )); then
echo
echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m"
echo "At least 2 CPUs recommended. You have $${cpus_available}"
echo
warning_resources="true"
fi
if (( disk_available < one_meg * 10 )); then
echo
echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m"
echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))"
echo
warning_resources="true"
fi
if [[ $${warning_resources} == "true" ]]; then
echo
echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m"
echo "Please follow the instructions to increase amount of resources available:"
echo " https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#before-you-begin"
echo
fi
mkdir -p /sources/logs /sources/dags /sources/plugins
chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins}
exec /entrypoint airflow version
# yamllint enable rule:line-length
environment:
<<: *airflow-common-env
_AIRFLOW_DB_MIGRATE: 'true'
_AIRFLOW_WWW_USER_CREATE: 'true'
_AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
_AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
_PIP_ADDITIONAL_REQUIREMENTS: ''
user: "0:0"
volumes:
- ${AIRFLOW_PROJ_DIR:-.}:/sources
  • flower: Monitors and provides insights into the environment. It is available at localhost:5555 .(Optional)
flower:
<<: *airflow-common
command: celery flower
profiles:
- flower
ports:
- "5555:5555"
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:5555/"]
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
  • postgres: Serves as the database.
postgres:
image: postgres:13
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
volumes:
- postgres-db-volume:/var/lib/postgresql/data
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 10s
retries: 5
start_period: 5s
restart: always
  • redis: Facilitates message forwarding from the scheduler to the workers. (Optional)
redis:
# Redis is limited to 7.2-bookworm due to licencing change
# https://redis.io/blog/redis-adopts-dual-source-available-licensing/
image: redis:7.2-bookworm
expose:
- 6379
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 30s
retries: 50
start_period: 30s
restart: always

Airflow Volumes:

Besides the common environment variables for the airflow services, we have four volumes: dags, logs, config and plugins. So we need to create 4 folders in local machine for Airflow Volumes

  1. dags: For placing the DAG scripts
  2. logs: For placing logs of airflow
  3. config: For Airflow Configurations
  4. plugins: For any Extra Plugins

Permissions :

For seamless volume synchronization, we need to confirm that the UID (user ID) and GID (group ID) permissions on the Docker volumes align with those on the local filesystem.

In YAML file if you notice in line no. 74 below line which provides permissions to airflow to us the volumes

user: "${AIRFLOW_UID:-50000}:0

In Local Environment run below commands

echo -e "AIRFLOW_UID=$(id -u)" > .env
echo -e "AIRFLOW_GID=0" > .env

Post this steps your .env file looks as below

AIRFLOW_UID=501 
AIRFLOW_GID=0

Link for full Docker Compose fileLink

Run Docker Compose:

Once we have all the setup we can run the docker compose file using below command. In my case i’m using podman. You are feel free to use docker/podman.

docker-compose up airflow-init
or
podman compose up airflow-init

Check the services using below command

docker compose up -d
or
podman compose up -d

Accessing the web interface:

Once you see all the containers up and running you can open localhost:8080 in browser and login to airflow with admin creds

Username : airflow

Password : airflow

Post logging in youcan see the Dags.

We’ve successfully installed the full version of Airflow in just a few minutes using Docker.

Clean up

docker compose down
or
podman compose down

References:

--

--