What is Airflow? Read this overview before using it for the first time

Gaël Charles
Axionable
Published in
10 min readJul 1, 2022
Photo by Viktorija Lankauskaitė on Unsplash

In this article, I will introduce Airflow — a platform for management, automation, and monitoring workflows. You will learn everything that needs to be known before using the tool for the first time. Indeed, this article acts as an almost complete guide on the tool main functionalities.

Most of the information given comes from the official documentation, which I can only suggest to check out to go further.

Some short background

Back in 2014, the Airbnb company was growing fast and needed to handle more and more data workflows. It was beginning to become complex to synchronize every task they had, so they wrote an open-source tool in Python to automate and monitor their workflows.

The project grew quickly the following two years with the contributions. It joined the Apache Foundation Incubator in 2016 and is today one of the most important ones handled by the Foundation.

Basic concepts

The DAG — Airflow core concept

A DAG (Directed Acyclic Graph) is a set of interdependent tasks that get executed in a precise way. In graph theory terminology, a directed graph means that each link has a direction; and acyclic means that there will be no closed loop. The tasks are the vertices (nodes) of the graph, and dependencies are the edges (links).

Here is a simple example of a DAG (or workflow):

A DAG with four tasks. Each task is a square and they are linked with arrows.

The great thing about Airflow (and its main feature, basically) is the workflow-as-code. All DAGs will be written as Python code in separate files. The simple DAG above can be written with these 10 lines of code:

import datetime
from airflow import DAG
from airflow.operators.dummy import DummyOperator

with DAG(
dag_id='example_1',
schedule_interval="0 8 * * *",
start_date=datetime.datetime(2022, 1, 1)
) as dag:
task1 = DummyOperator(task_id='task1')
task2 = DummyOperator(task_id='task2')
task3 = DummyOperator(task_id='task3')
task4 = DummyOperator(task_id='task4',
trigger_rule="all_success")

task1 >> [task2, task3] >> task4

Here, and also in the graph, task1 will be executed first; when done, task2 and task3 will run concurrently; when both done, task4 will be executed last.

A DAG can be scheduled: one can use crontab syntax to set up an execution frequency. Once triggered (either by scheduling or manually), a DAG creates a DAG instance — which is the DAG at a precise timecode.

Tasks

The second main Airflow component is the task. Tasks are interdependent basic bricks of DAGs and can be of all different types. A task instance is defined by its DAG and the DAG’s execution date. There are two main types:

Operators are the most common type. Those tasks will execute some predefined action; they can be:

  • BashOperator: will run a bash command.
  • PythonOperator: will run Python code.
  • DockerOperator: will run a command in a Docker container.
  • MySQLOperator / PostgresOperator, etc: will run an SQL query on a server.
  • And many more.

Sensors are the second main type. When run, those tasks will wait for a specific event to happen, and trigger their downstream workflow from that point. They can be:

  • SFTPSensor: will wait for a specific file / folder to be in an SFTP server.
  • SqlSensor: will repeatedly run an SQL query until a criterion is met.
  • HTTPSensor: will repeatedly run an HTTP request until a criterion is met.
  • And many more.

If you look back at the last line of the DAG code a few lines above, you’ll notice the involvement of Python’s bitwise shift operators (>> and <<). You can use them with Airflow to define tasks dependencies and your Python code will gain lots of readability. A >> B simply means that task A will be executed prior to task B: it’s as simple as that!

Airflow’s functional architecture

Diagram of Airflow architecture. It is later described in the article.

Before deep-diving into Airflow’s main functionalities, it’s worth it to get a glimpse of Airflow’s functional architecture. First, the two components that the regular user will interact with are the DAG folder and the web server (used to visualize DAGs, tasks, monitor and debug them, etc.). However, the other components run actions in the background that are worth knowing about.

The scheduler will parse Python DAG files and create the dependencies graphs that will be stored. It detects tasks ready to be run, and submits them to the workers using a queuing system. The parsing is done automatically, and DAG files can be edited while Airflow is running to add a feature, fix a bug, and so on.

The workers are the components doing the processing (running the tasks). The basic Airflow installation will only have one worker, the local executor, which will run on the same machine that the scheduler. More complex installations define many independent workers that will parallelize the workload on different nodes. Workers retrieve ready tasks and update their statuses.

All of those components interact with a metadata database which holds logs, DAGs, tasks instances and their states.

Airflow functionalities

We’ll now dive into Airflow’s main functionalities. We’ll focus on eight main features that you should know about.

Resilience

Airflow is flexible thanks to callbacks. Callbacks let users define a strategy in case their workflow fails. This configuration can be applied at a DAG-level or task-level.

When a task fails, Airflow can be configured to retry the task after a specific time, as many times as predefined. A specific action can be executed if and only if it gets retried. When all the retries failed, we can also configure a different action (such as sending an e-mail to notify the user). Callbacks can also be configured to trigger when an action succeeds.

A task will fail when an error occurs in task execution. This is helpful to define a custom failing scenario in PythonOperators per example, by defining a case that raises a Python exception.

Backfill

In the event of an error or a new feature, you can rerun a DAG (or a part of it) on a specific period of time.

Let’s take an example:

A diagram shows a buggy task circled in red that blocks the downstream tasks, circled in orange. The upstream tasks are circled in green.

We have a DAG with 6 tasks. It has been 3 days that the DAG stops its execution at “buggy_task” because it fails. The downstream tasks have not been executed since because of this.

The developer will then fix the bug and rerun the buggy task and its dependencies on the given time range. This can be done in only one action with the following command in your terminal.

airflow tasks clear <task_name> <dag_name> -d
–s <start_date>
-e <end_date>

Note that we prevent the upstream tasks from being executed again with this command — which is not the default behavior.

The same diagram with the buggy task, but they are not circled anymore. The upstream tasks remain green.

Interfaces

How can the regular user interact with Airflow? Well, with interfaces. Airflow has three main interfaces.

The Command Line Interface (CLI) is the “rawest” way to interact with Airflow. It supports precise operations and reproductible commands. We won’t go into more details because the goal of this article is to have an overview of Airflow and how it works; it’s not to practice and become an Airflow expert — lots of guides can be found on the net to achieve this.

CLI works on any of Airflow’s nodes.

The web User Interface (UI) is the most common way to interact with Airflow. It’s a hosted website that lets you view DAGs, tasks, instances of them, trigger them, view logs, access environment variables, etc.

A screenshot of the web interface. We see a list of DAGs with task and execution details.
A screenshot of the web UI.

The third and less common way is the REST API. Airflow allows the user to call specific endpoints to do Airflow actions programmatically, such as listing or updating DAGs.

Monitoring

With Airflow, one can monitor every component, in a very much customizable way. Most of the monitoring can be done through the UI though, but we can also go one step further. One can configure Airflow to connect to third-party tools (such as StatsD or Sentry) to send metric / user / error data.

Airflow’s health is also available through an API endpoint, or via CLI — this can be useful sometimes.

Macros

Airflow handles even more scalability with « Jinja templating ». It allows us to use injectable macros everywhere in the code found in the DAG files, that are read by the executor. Let’s take some examples:

╔════════════════════════╦═════════════════════════════════════╗
MacroUsage
╠════════════════════════╬═════════════════════════════════════╣
║ {{ ds }} ║ DAG's execution date ║
║ {{ dag }} ║ DAG name ║
║ {{ task }} ║ Task name ║
║ {{ var.value.my_var }} ║ Custom user-defined global variable ║
╚════════════════════════╩═════════════════════════════════════╝

I’m sure you already guessed what it does: when read, a macro gets literally replaced by its value. A common case would be to adapt your task scenario based on the DAG’s execution date for example.

XComs

XComs (“cross-communications”) allow tasks to communicate by exchanging small amounts of data between them. Indeed, tasks are isolated from each other. An XCom can be seen as a DAG-level local variable, stored in Airflow’s meta-database.

Let’s take an example. I have a task A that needs to do some computations. When done, it writes an XCom value in memory. This XCom is linked to task A ID. Later in the workflow, task B has to retrieve this value and make a specific operation if it’s greater than 100. It will use task A ID to retrieve the XCom value, and adjust its actions from there.

Important: XComs are made to exchange small amounts of data like numbers or strings. Do not use them for dataframes, entire file contents, etc., as you’ll end up with memory overflow errors. However real-world Airflow use-cases often need to manipulate big amounts of data. So, what’s the trick? The good practice is to use an external database that your tasks will interact with using CRUD operations. Depending on your use case, you can also use a data processing framework such as Spark.

Libraries and plugins

Airflow can have its functionalities scale even more with new operators, sensors, hooks, or connections. As it’s an open-source solution, lots of external plugins and libraries are made by the community. You can find for instance every open-source plugin on this repository: https://github.com/orgs/airflow-plugins/repositories

Official libraries can offer more integrations to external tools such as Cloud Providers (AWS, Azure, GCP) or other Apache tools (Spark, HDFS, Hive) for instance.

The reason why anyone can create an external plugin is because it’s so easy to develop your very own set of operators using class inheritance. The following example should speak for itself: we inherited Airflow’s BaseOperator and overrode the execute() method to define what we want inside.

from airflow.models.baseoperator import BaseOperator 


class HelloOperator(BaseOperator):
def __init__(self, name: str, **kwargs):
super().__init__(**kwargs)
self.name = name

def execute(self, context):
message = f”Hello {self.name}”
print(message)
return message

Deployment possibilities

If you plan on using Airflow, you’ll have to ask yourself how do you need to deploy Airflow. We’ll present three different manners that suit different use case needs:

The “ready-to-go” deployment

This option uses a docker-compose on your own machine. You should consider this option only if you want to test Airflow quickly and try some devs.

  • Setup: only two commands
  • Performance: restricted by the local machine’s configuration
  • Resilience: none
  • Cost: minimal

Please note that you can also install Airflow as a standalone instance on your computer, without the need of docker-compose.

The custom installation

This option will deploy Airflow on a Kubernetes cluster and is yet very portable to the cloud.

  • Setup: you’ll need some time to configure everything
  • Performance: scalable and parallelizable
  • Resilience: high by duplicating each component’s instance
  • Cost: depends on how many nodes you need. Maintenance comes also in the equation.

The managed option

This option will suit business cases, as Airflow will be deployed on a Cloud-managed cluster. The integration is really easy.

  • Setup: a few clicks
  • Performance: automatic scaling
  • Resilience: high
  • Cost: no maintenance, but a high cost (depends on the Cloud Provider, the needs, etc.)

Limits

Airflow is justified when your pipeline count gets higher. However, it’s not suitable in every situation. Let’s take a look at its limitations.

  • Static pipelines: every DAG must have a static scheduling (ex: a CRON) and may only be executed once per schedule. You also can’t repeat some part of your DAG an unknown number of times: everything must already be defined at the beginning.
  • Limited data transfers: Airflow is not made to handle dataflow. You can exchange metadata with XComs, but only very small amounts. If bigger data needs to be handled (most common case), you’ll have to use external data sources.
  • No universal configuration: configuring an Airflow cluster needs lots of parameters to take into account and it’s not the easiest task.

Some other concurrent tools exist and try to patch these flaws.

Last words

In summary, Airflow is a great orchestration tool that can suit lots of your typical workflow use-cases — even with these limitations. When you have one or more workflows with complex dependencies, you’ll probably gain time using Airflow.

Have fun orchestrating!

Readers: feel free to comment this story! Tell me if you learned something, if you want a new generic article on Airflow, or on a specific Airflow subject.

Warm thanks to:
Antoine Marchais for the contributions, and all my reviewers at Axionable.

--

--