Introduced by Airbnb, Airflow is a platform to schedule and monitor data pipelines. Especially when it comes to having complex pipelines consisting of many steps of task executions and relation among them. At WorkGenius we are using Airflow for a variety of purposes, such as extracting freelancers performance and skills as well as pre-processing of data for the recommendation model.
In this article, first we overview the basic setup to run Airflow including an example use-case. Then we address how it can be optimized by different configuration parameters and the internal tools it provides.
Key concepts of Airflow such as DAG are skipped as they are well explained in the main documentation. If you are already familiar with the installation process, you can skip and go to the “Airflow Tips” section.
For the demo we use Airflow 1.10.3 via pip on Ubuntu.
When installing Airflow, it requires a directory to store the config and other files and directories (logs, database, etc). This directory is defined by the environment variable AIRFLOW_HOME, and by default it is set to ~/airflow. In case you want to change it to another directory, export this variable to your desired directory:
pip install apache-airflow==1.10.3
After installing Airflow, the first task is to initialize the database. This is done by the following command
When the initialization is finished, we can check whether the installation is successful by running the command
This should produce an output similar to this:
Airflow provides two main processes; one is the server, which provides a web UI to monitor and manage the DAGs. The other one is the scheduler that is responsible to manage, schedule and run the DAGs. They can be run separately by:
By default, all the configuration of Airflow is defined in the file airflow.cfg under AIRFLOW_HOME directory. For example, initially, Airflow…