Introduction to Apache Airflow

(A Gentle Introduction to Workflow Management Platform)

Mukesh Kumar

Published in

Accredian

7 min readAug 3, 2022

Written in collaboration with Hiren Rupchandani

Logo of Apache Airflow (Source: **Wikipedia**)

Preface

In the previous story, I gave an overview of data and data pipelines, their necessity, the dynamic components, and the tools required to build data pipelines. In this story, I will walk you through a popular workflow management framework called Apache Airflow. Let’s dive in…

About Apache Airflow

A formal definition of Airflow from Airbnb is as follows:

Airflow is a platform to programmatically author, schedule, and monitor workflows.

Apache Airflow, a workflow management platform, is an open-sourced application by Airbnb that manages directed acyclic graphs (DAGs) and their associated tasks. The organization released this project as an open-source application in October 2014, and since then, gaining popularity among people who treasure open-source applications and projects, specifically those who work with data-oriented projects.

The default language that works with the tool is Python using which we can define DAGs containing an inter-related set of tasks.

Airflow Principles

Airflow provides a rich interface for creating and maintaining data-oriented project workflows. It marks the following principles that make it stand apart from its competitors:

Dynamic: Pipelines can be instantiated dynamically with the help of the Python programming language.
Scalability: Its modular architecture gives the ability to scale the number of workers (up or down) according to the requirements.
Extensible: It allows us to define custom operators and integration with third-party tools such as StatsD, MySQL, etc.
Elegant: With the help of Jinja (a web template engine for Python), parametrization is possible in Airflow. It also features an easy-to-learn user interface.

History of Apache Airflow

Airbnb developed Airflow to manage its large and complex network of computational jobs. They made the project open-sourced in October 2014 and became a part of Apache’s Incubator program in March 2016 before finally becoming a Top-Level Project of Apache Software Foundation in January 2019. Today it is adopted by more than 400 companies, together with Airbnb, Robinhood, and Twitter in their data architecture.

Number of contributions to Airflow’s GitHub from October 2014 to July 2021

Airflow Architecture

Airflow is a workflow scheduler and management program mainly used for developing and maintaining data pipelines. The jobs get devised using the representation of Directed Acyclic Graphs (DAGs) containing a set of interrelated tasks. Before diving into the architecture, let’s see a high-level overview of some basic terms in Airflow:

Basic Concepts

Airflow has some basic terms that I will explain throughout the series while building and monitoring data pipelines. These terms are as follows:

Tasks

It is the basic unit of execution. It can be reading the data from a database, processing the data, storing the data in a database, etc. There are three basic types of Tasks in Airflow:

Operators: They are pre-defined templates used to build most of the Tasks.
Sensors: They are a unique subclass of Operators and have only one job — to wait for an external event to take place so they can allow their downstream tasks to run.
TaskFlow: It was recently added to Airflow, and provides the functionality of sharing data in a data pipeline.

Directed Acyclic Graphs

Tasks related via dependencies, forming a DAG

In basic terms, a DAG is a graph with nodes connected via directed edges and has no cyclic edges between the nodes. In Airflow, the Tasks are the nodes, and the directed edges represent the dependencies between Tasks.

Control Flow

A DAG has directed edges connecting nodes (tasks) that interjoins two or more tasks. It defines the carryout workflow within a DAG.

Task instance

Execution of a single task. They also indicate the state of the Task, such as “running”, “success”, “failed”, “skipped”, “up for retry”, etc. The color codes of various states of tasks are as follows:

States of Tasks indicated by color codes in Airflow UI

DAGrun

When a DAG is triggered in Airflow, a DAGrun object is created. DAGrun is the instance of an executing DAG. It contains a timestamp at which the DAG was instantiated and the state (running, success, failed) of the DAG. DAGruns can be created by an external trigger or at scheduled intervals by the scheduler.

Example

For example, we assume having four different tasks as follows, as shown in the image:

Reading Data: To read data from the source.
Processing Categorical Data: To process the categorical data.
Processing Continuous Data: To process the continuous data.
Merging Data: To combine the processed categorical and continuous data.

The Reading Data is an upstream task for Processing Categorical Data and Processing Continuous Data, and Merging Data is a downstream task for Processing Categorical Data and Processing Continuous Data.

The execution steps are as follows:

When a DAG is triggered, a DAGrun instance is created which sets the state of the DAG to “running” and also consists of the states of the tasks. The task states are set as “queued” and are scheduled based on their dependencies.
So the first task (Reading Data) that reads the data is executed. This task is set to the “running” state while the other tasks remain “queued”. After its successful execution, its state is set to “success” else “failed”. You can refer to section (a.) in the above figure.
According to the dependencies set, the processing tasks (Processing Categorical Data and Processing Continuous Data) are scheduled to be executed after the Reading Data task.
So their state is set to “running” and their execution begins (either serially or in parallel depending on their executor). You can refer to fig. b in the above diagram.
After their successful execution, their status is changed to “success” and the next task’s (Merging Data) state is set to “running”. You can refer to fig. c in the above diagram.
Because of the dependencies set, the final task of merging the data is executed only after the processing tasks have successfully finished their execution.
After the final task has been successfully executed, its state is set to “success”. For further reference, you can check fig. d in the above image.
Since the DAG has been successfully executed, its state is set to “success”.

Airflow’s Components

Now that the basic terminologies are clear with an example, let’s see the primary components that form the architecture of Airflow:

A generalized Airflow Architecture with essential components

Webserver: A simple user interface to inspect, trigger, and debug the working of DAGs with the help of logs. It displays the task state and enables users to interact with the metadata database.

Executor: It is the mechanism that handles the running of tasks. Airflow has many executors, primarily Sequential Executor, Local Executor, and Debug Executor. There also exist remote executors for complex problems, such as Celery Executor, Dask Executor, Kubernetes Executor, and CeleryKubernetes Executor.

Workers: Executors work closely with the workers to execute tasks. It assigns tasks waiting in the queue to the workers.

Scheduler: A scheduler has two tasks:

To trigger the scheduled DAGs.
To submit the tasks to the executor to run.

It is a multi-threaded Python process that uses the DAG information to schedule the tasks. It stores the information of each DAG in the metadata database.

Metadata Database: It provides the facility of storing information of components interacting with each other plus their state by one of the webserver, scheduler, or executors. The common structured database we can use to store the metadata are SQLAlchemy, MySQL, and Postgres DB.

Working of the Airflow Components

Working of the Components in the Airflow architecture

The Scheduler constantly continues tapping the dag directory and makes an entry for each DAG in the database.
The airflow parser parses the DAGs and creates the DAGRuns. The Scheduler also creates instances of the tasks that require execution.
All these tasks are marked “scheduled” in the database.
The primary scheduler then takes these tasks and sends them to a queue. These tasks are then marked as “queued”.
The executor fetches the task from the scheduler queue and assigns them to the workers.

Final Thoughts and Closing Comments

There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this story, I recommend you to go with the Global Certificate in Data Science because this one will cover your foundations plus machine learning algorithms (basic to advance).

& That’s it. I hope you liked the explanation of the introduction to apache airflow and learned something valuable. If you have anything to share with me, let me know in the comment section. I would love to know your thoughts.

Follow me for more forthcoming articles based on Python, R, Data Science, Machine Learning, and Artificial Intelligence.

If you find this read helpful, then hit the Clap👏. Your encouragement will catalyze inspiration to keep me going and develop more valuable content.