Set up Apache Airflow on a Multi-Node Cluster With PostgreSQL and RabbitMQ

Ali Aminzadeh
6 min readAug 20, 2023

--

Introduction:

Apache Airflow is an open-source platform widely used for orchestrating, scheduling, and monitoring complex workflows. It provides a flexible and scalable solution for managing data pipelines, making it a popular choice among data engineers and data scientists.

In this tutorial, we will guide you through the process of setting up Apache Airflow on a multi-node cluster. By leveraging the power of a distributed environment, you can harness the full potential of Airflow’s capabilities and ensure high availability, scalability, and fault tolerance.

To enhance the reliability and scalability of our Airflow deployment, we will utilize PostgreSQL as the metadata database and RabbitMQ as the message broker. PostgreSQL offers robust data storage capabilities, while RabbitMQ enables efficient communication between Airflow components.

Throughout this tutorial, we will cover the step-by-step installation and configuration of the multi-node cluster, including the setup of PostgreSQL and RabbitMQ. By the end, you will have a fully functional Airflow environment ready to execute and manage your workflows efficiently.

Whether you are a data engineer, data scientist, or anyone interested in workflow automation, this tutorial will provide you with the knowledge and guidance to establish a powerful Airflow setup on a multi-node cluster. Let’s dive in and unlock the potential of Apache Airflow for your data pipeline management needs.

1. Setup System & Install PostgreSQL

First step update your Ubuntu system.

sudo apt-get update && apt-get upgrade -y && apt-get autoremove && apt-get autoclean

Now we need to install PostgreSQL:

sudo apt-get -y install postgresql postgresql-contrib libpq-dev postgresql-client postgresql-client-common

Make sure PostgreSQL is installed successfully and its service is running:

$ sudo systemctl status postgresql


● postgresql.service - PostgreSQL RDBMS
Loaded: loaded (/lib/systemd/system/postgresql.service; enabled; vendor preset: enabled)
Active: active (exited) since Tue 2023-08-01 14:22:15 +0330; 2 weeks 4 days ago
Main PID: 21633 (code=exited, status=0/SUCCESS)
CPU: 4ms

Then we configure our PostgreSQL database for airflow:

sudo -u postgres psql

Create airflow user:

CREATE USER airflow PASSWORD 'your-password';

Create airflow database:

CREATE DATABASE airflow;

Grant proper permissions to airflow user:

GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO airflow;

In final step, make sure you configure access management to PostgreSQL based on your cluster needs and configuration, in following conf files:

/etc/postgresql/15/main/pg_hba.conf

/etc/postgresql/15/main/postgresql.conf

Install & Configure RabbitMQ:

RabbitMQ is a queueing service that implements the Advanced Message Queuing Protocol (AMQP). It is a fast and dependable open-source message server that supports a wide range of use cases including reliable integration, content-based routing and global data delivery, and high volume monitoring and data ingestion.

As I’m writing this blog, latest version of RabbitMQ only works with Erlang version 25 or below, so make sure you install proper version of Erlang or you will face errors and your Airflow components won’t work properly.

First add Erlang version 25 repository to apt sources:

sudo add-apt-repository ppa:rabbitmq/rabbitmq-erlang-25

Then install Erlang and RabbitMQ:

## Update package indices
sudo apt-get update -y

## Install Erlang packages
sudo apt-get install -y erlang-base \
erlang-asn1 erlang-crypto erlang-eldap erlang-ftp erlang-inets \
erlang-mnesia erlang-os-mon erlang-parsetools erlang-public-key \
erlang-runtime-tools erlang-snmp erlang-ssl \
erlang-syntax-tools erlang-tftp erlang-tools erlang-xmerl

## Install rabbitmq-server and its dependencies
sudo apt-get install rabbitmq-server -y --fix-missing

Install Essential Dependencies:

sudo apt-get update -y

sudo apt-get install curl gnupg -y

After installation, RabbitMQ service is started and enabled to start on boot. To check the status, run:

$ sudo systemctl status rabbitmq-server.service

● rabbitmq-server.service - RabbitMQ broker
Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2022-05-12 23:57:22 EAT; 40s ago
Main PID: 5631 (beam.smp)
Tasks: 28 (limit: 9460)
Memory: 95.0M
CPU: 3.177s
CGroup: /system.slice/rabbitmq-server.service
├─5631 /usr/lib/erlang/erts-12.2.1/bin/beam.smp -W w -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -P 1048576 -t 5000000 -stbt db -zdbbl 128000 -sbwt none -sbwtdcpu>
├─5642 erl_child_setup 32768
├─5672 /usr/lib/erlang/erts-12.2.1/bin/epmd -daemon
├─5699 inet_gethost 4
└─5700 inet_gethost 4

Enable RabbitMQ Dashboard:

sudo rabbitmq-plugins enable rabbitmq_management

The Web service should be listening on TCP port 15672

$ sudo ss -tunelp | grep 15672

tcp LISTEN 0 128 0.0.0.0:15672 0.0.0.0:* users:(("beam.smp",pid=9525,fd=71)) uid:111 ino:39934 sk:9 <->

If you have an active UFW firewall, open both ports 5672 and 15672:

sudo ufw allow proto tcp from any to any port 5672,15672

Access it by opening the URL http://[server IP|Hostname]:15672

By default, the guest user exists and can connect only from localhost. You can login with this user locally with the password “guest”

To be able to login on the network, create an admin user like below:

sudo rabbitmqctl add_user admin StrongPassword

sudo rabbitmqctl set_user_tags admin administrator

Create new Virtualhost with name airflow:

sudo rabbitmqctl add_vhost airflow

Grant user permissions for vhost “airflow” and user “admin”:

sudo rabbitmqctl set_permissions -p airflow admin ".*" ".*" ".*"

RabbitMQ service is configured and ready to use.

Install & Configure Airflow Web Server and Scheduler on Multiple Machines

Now that our PostgreSQL and RabbitMQ is installed and configured properly, we can install Airflow and setup its components:

In this tutorial we are going to install Airflow version 2.7.0 using pip and python 3.9

First create a venv:

python3.9 -m venv airflow-2

Activate created venv:

source airflow-2/bin/activate

Now install Airflow for postgre and celery using command below:

pip install "apache-airflow[celery,postgres]==2.7.0" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.7.0/constraints-3.9.txt
pip install psycopg2

After Airflow is installed successfully, we must change its configurations in its conf file. This file is usually in your Airflow user home path:

sudo nano /home/airflow/airflow/airflow.cfg

Apply following configurations for your PostgreSQL and RabbitMQ services and their configurations and credentials so our Airflow services can connect to them properly:

Here we are changing the SequentialExecutor to a CeleryExecutor which is responsible for parallel execution on celery and RabbitMQ.

The SQL LIT which is the default database for Apache Airflow metadata can not handle the parallel execution so we changed the SQL alchemy connection to PostgreSQL.

[core]
executor = CeleryExecutor

[database]
sql_alchemy_conn = postgresql+psycopg2://host/db_name?user=airflow&password=your-password


[celery]
broker_url = amqp://admin:password@host/airflow
result_backend = db+postgresql+psycopg2://airflow:password@host/airflow_result

You can change other configurations based on your needs.

After you set proper configurations, initialize database using command below:

airflow db init

Then create a user so we can login to airflow web UI using it:

airflow users create --username admin --password admin --firstname admin --lastname admin --role Admin --email admin@admin.com

Start Airflow Web Server:

 airflow webserver

Start Airflow Scheduler:

airflow scheduler

If everything is setup properly, now these services should be in running state without any errors or problems.

Check your localhost on port 8080, Enter the credentials you have created you are good to go.

Install Airflow Web Server And Scheduler on Multiple Nodes:

In order to make our Airflow Web Server and Scheduler high available and run on multiple nodes, we need to follow all previous steps(Install & Configure Airflow Web Server and Scheduler on Multiple Machines) on other machines with the exact configurations.

Before that, make sure all these machines can see each other and the master node that PostgreSQL and RabbitMQ is installed on, is accessible by these machines.

Install Airflow Worker:

Now that our web servers and schedulers are up and running, install Airflow on your worker nodes using previous commands with the same configurations as Airflow web servers and schedulers machines. Then start the workers with command below:

airflow celery worker -q airflowtasks

Congratulations, now you have successfully installed Airflow with multiple web servers and schedulers so you can have a very high available Airflow setup.

--

--