Setting up Apache Airflow with Celery Executor on Your Linux Machine
Apache Airflow stands out as a powerful open-source platform for orchestrating workflows. In this comprehensive guide, we’ll walk through the process of setting up Apache Airflow with the Celery Executor on your Linux machine, transforming it into a distributed environment comprising a master node and two worker nodes. This configuration enables parallel task execution, ensuring scalability and efficient resource utilization on your Linux setup.
Step 1: Update System Packages
Start by ensuring that all system packages on your Linux machine are up-to-date. This is a crucial initial step to prevent potential compatibility issues during the installation process.
sudo apt update && sudo apt upgrade -y
Step 2: Install Dependencies and Create Virtual Environment
Install necessary dependencies and create a virtual environment:
sudo apt install build-essential python3-dev libsqlite3-dev openssl sqlite default-libmysqlclient-dev libmysqlclient-dev
sudo apt install python3.8-venv
python3 -m venv ve
source ve/bin/activate
Step 3: Install Apache Airflow
Install Apache Airflow within the virtual environment:
pip install 'apache-airflow[celery]==2.6.3' --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.6.3/constraints-3.8.txt"
pip install mysqlclient
airflow version
Step 4: Set Up DAGs Directory
Create a directory for DAGs:
mkdir /root/airflow/dags
cd /root/airflow/dags
Step 5: Mount DAGs & Logs Directory to a DFS
Mount the DAGs and logs directory to a distributed file system.
Step 6: Edit Airflow Configuration
Edit the airflow.cfg
configuration file with the following details:
executor = CeleryExecutor
sql_alchemy_conn = mysql://*****:**@******:3306/airflow_demo
broker_url = amqp://****:****@*****:5672/airflowhost
dag_dir_list_interval = 30
Step 7: Initialize Airflow
Initialize Airflow after configuration:
airflow db init
Step 8: Repeat Steps 1–7 on Worker Nodes
Repeat steps 1 to 7 on worker nodes. Replace the airflow.cfg
file from the workers with the master's airflow.cfg
.
Start the Airflow webserver and scheduler on the master:
airflow webserver -p 8080
airflow scheduler
Step 10: Start Airflow Workers on Worker Nodes
Start Airflow workers on both worker nodes:
airflow celery worker
Conclusion
Your Apache Airflow setup with Celery Executor on a distributed environment is now complete. You can access the Airflow UI on the master node at http://<master-node-ip>:8080
and start managing and monitoring your workflows.