Airflow: The Beginners Guide

Vipul Pandey
Walmart Global Tech Blog
4 min readDec 5, 2019

Airflow is an open source platform for programatically authoring, scheduling and managing workflows.

In Airflow workflows are defined as Directed Acyclic Graph (DAG) of tasks. The main components of Airflow are Scheduler , Worker and Webserver which work in the following way —

  1. The scheduler executes your tasks on an array of workers while following the specified dependencies.
  2. Rich command line utilities makes is easy to perform complex operations on DAGs.
  3. The rich user interface provided by Airflow Webserver makes it easy to visualize pipelines, monitor their progress, and help in troubleshooting issues.

When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.

In this tutorial (first part of the Airflow series) we will understand the basic functionalities of Airflow by an example and comparing it with the traditional method of Cron.

Problem Statement -

Suppose we have to automate a pipeline in which there is a set of tasks which run daily at 9 am UTC and does the following in the given sequence -

  • Polls for data availability
  • Extracts and processes the data
  • Stores in some data store (say hdfs)

Conventional method -

In the Conventional method this can be achieved by creating three scripts and a script to wrap all of these in a single unit and finally the wrapped script is run through a Cron scheduled for 9 am UTC.

Drawbacks Of Conventional Method -

  • No implicit alerting
  • Monitoring Cron logs is a complicated task
  • No implicit retry logic
  • No visualisation
  • No/Complex distributed computing
  • Complex Parallelism
  • No statistical data
  • ……[list will never end]

Airflow to the Rescue -

Airflow is a combination of scheduling + alerting + monitoring platform and can work independently without any modification in the main job code i.e. the actual tasks are untouched.

The above sequence of tasks can be achieved by writing a DAG in Airflow which is a collection of all the tasks you want to run, organised in a way that reflects their relationships and dependencies.

Sample Airflow Dag code (https://gist.github.com/vipul007ravi)

In this DAG code (say my_first_dag.py) the wrapping script of the conventional method is replaced by Airflow DAG definition which run the same three shell scripts and creates a workflow.

The above Airflow DAG can be broken into 3 main components of airflow -

  1. Default Arguments — the ‘args’ dictionary in the DAG definition specifies the default values which remain same across the DAG. For example the default arguments specify number of retries which for instance is set to 1 for this DAG. These values can be altered at task level.
  2. Operators — Tasks in airflow are created by operators i.e. a task can be defined by one of the many operators available in Airflow. For example in the above code, Check_Data_Availability is a task which is a shell script and hence is specified as a BashOperator.
  3. Dependencies — Dependencies define the flow of Airflow DAG. It can be specified as downstream or upstream. For instance, in the above code Extract_Process_Data is dependent on the Check_Data_Availability and is executed once the Check_Data_Availability task is complete.

These are the main building blocks of Airflow. Once the DAG is defined it is ready to be picked up by Scheduler (replacement of Cron) at the time specified in the DAG and is sent to the workers for execution.

Airflow UI -

Once the DAG is available in the DAGs folder it automatically gets picked up and is available in the UI for Visualisation and Monitoring. Below is the snapshot of the DAG as it is seen in the UI -

Webserver UI of Airflow

We can see the DAG dependencies and visualise the workflow in the Graph View of the DAG -

Graph View of Dag in Airflow

The above image describes the workflow i.e. the sequence in which the tasks has to be executed. (Check_Data_Availability -> Extract_Process_Data -> Insert_Into_Hdfs)

Advantages -

  1. Airflow provides implicit alerting. This means we can define alerting at the DAG level by specifying the email id of the user who needs to be notified on retry or failure etc.
  2. Airflow UI provides real time logs of the running jobs. Further it provides strong functionality to access older logs by archiving them.
  3. Airflow is highly scalable. Tasks can be distributed across workers making the system highly scalable also making it fault tolerant and highly available.
  4. Airflow UI provide statistical information about jobs like the time taken by the dag/task for past x days, Gantt Chart, etc.

--

--

Vipul Pandey
Walmart Global Tech Blog

SDE-3, Personalisation, @WalmartLabs, Bengaluru | IIIT Allahabad