Understanding Apache Airflow’s key concepts

Part Three of a Four-part Series

In Part I and Part II of Quizlet’s Hunt for the Best Workflow Management System Around, we motivated the need for workflow management systems (WMS) in modern business practices, and provided a wish list of features and functions that led us to choose Apache Airflow as our WMS of choice. This post aims to give the curious reader a detailed overview of Airflow’s components and operation. We’ll cover Airflow’s key concepts by implementing the example workflow introduced in Part I of the series (see Figure 3.1).

Figure 3.1: An example data processing workflow.

Workflows are “DAGs”

Workflows in Airflow are collections of tasks that have directional dependencies. Specifically, Airflow uses directed acyclic graphs — or DAG for short — to represent a workflow. Each node in the graph is a task, and edges define dependencies amongst tasks (The graph is enforced to be acyclic so that there are no circular dependencies that can cause infinite execution loops).

Figure 3.2 Screenshots from the Airflow UI, Representing the example workflow DAG. Top Subpanel: The Graph View of the DagRun for Jan. 25th. Dark green nodes indicate TaskInstances with “success” states. The light green node depicts a TaskInstance in the “running” state. Bottom Subpanel: The Tree View of the example_workflow DAG. The main components of Airflow are highlighted in screen shot, including Sensors, Operators, Tasks, DagRuns, and TaskInstances. DagRuns are represented as columns in the graph view — the DagRun for Jan. 25th is outlined in cyan. Each square in the graph view represents a TaskInstance — the TaskInstance for the (“running”) perform_currency_conversion task on Jan. 25th is outlined in blue.

Operators, Sensors, and Tasks

Although the DAG is used to organize tasks and set their execution context, DAGs do not perform any actual computation. Instead, tasks are the element of Airflow that actually “do the work” we want performed. Tasks can have two flavors: they can either execute some explicit operation, in which case they are an Operator, or they can pause the execution of dependent tasks until some criterion has been met, in which case they are a Sensor. In principle, Operators can perform any function that can be executed in Python. Similarly, Sensors can check the state of any process or data structure.

Defining Task Dependencies

A key strength of Airflow is the concise and intuitive conventions for defining dependencies among tasks. The code below shows how we would define the task dependency graph for our example workflow:

DagRuns and TaskInstances

Once we’ve defined a DAG — i.e. we’ve instantiated tasks and defined their dependencies — we can then execute the tasks based on the parameters of the DAG. A key concept in Airflow is that of an execution_time. When the Airflow scheduler is running, it will define a regularly-spaced schedule of dates for which to execute a DAG’s associated tasks. The execution times begin at the DAG’s start_date and repeat every schedule_interval. For our example the scheduled execution times would be (‘2017–01–01 00:00:00’, ‘2017–01–02 00:00:00’, ...). For each execution_time, a DagRun is created and operates under the context of that execution time. Thus a DagRun is simply a DAG with some execution time (see the Bottom Subpanel of Figure 3.2).

Airflow’s Architecture

At its core, Airflow is simply a queuing system built on top of a metadata database. The database stores the state of queued tasks and a scheduler uses these states to prioritize how other tasks are added to the queue. This functionality is orchestrated by four primary components (refer to the Left Subpanel of Figure 3.2):

Figure 3.2: Airflow’s General Architecture. Airflow’s operation is built atop a Metadata Database which stores the state of tasks and workflows (i.e. DAGs). The Scheduler and Executor send tasks to a queue for Worker processes to perform. The Webserver runs (often-times running on the same machine as the Scheduler) and communicates with the database to render task state and Task Execution Logs in the Web UI. Each colored box indicates that each component can exist in isolation from the other components, depending on the type of deployment configuration.

Scheduler Operation

At first, the operation of Airflow’s scheduler can seem more like black magic than a logical computer program. That said, understanding the workings of the scheduler can save you a ton of time if you ever find yourself debugging its execution. To save the reader from having to dig through Airflow’s source code (though we DO highly recommend it!), we outline the basic operation of the scheduler in pseudo-code:

Web UI

In addition to the primary scheduling and execution components, Airflow also includes components that support a full-featured Web UI (refer to Figure 3.2 for some UI examples), including:

Command Line Interface

In addition to the Scheduler and Web UI, Airflow offers robust functionality through a command line interface (CLI). In particular, we found the following commands to be helpful when developing Airflow:

  • airflow backfill DAG_ID TASK_ID -s START_DATE -e END_DATE. Performs backfills of historical data between START_DATE and END_DATE without the need to run the scheduler. This is great when you need to change some business logic of a currently-existing workflow and need to update historical data. (Note that backfills do not create DagRun entries in the database, as they are not run by the SchedulerJob class).
  • airflow clear DAG_ID. Removes TaskInstance records in the metadata database for the DAG_ID. This can be useful when you’re iterating on the functionality of a workflow/DAG.
  • airflow resetdb: though you generally do not want to run this command often, it can be very helpful if you ever need to create a “clean slate,” a situation that may arise when setting up Airflow initially (Note: this command only affects the database, and does not remove logs).

Data Scientist @Quizlet

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store