Getting started with Apache Airflow

Rauf Onur Cullu
KoçDigital
Published in
6 min readApr 7, 2022

--

Credit Airflow Official Site

In this post, I am going to discuss Apache Airflow, a workflow management system developed by Airbnb.

What is Airflow?

The official definition of Airflow on its Apache homepage is:

Airflow is a platform to programmatically author, schedule, and monitor workflows.

Use airflow to author workflows as directed acyclic graphs Directed Acyclic Graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on Directed Acyclic Graphs (DAGs) a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

What is Dag?

In mathematics and computer science, a directed acyclic graph (DAG), is a finite directed graph with no directed cycles. That is, it consists of finitely many vertices and edges, with each edge directed from one vertex to another, such that there is no way to start at any vertex v and follow a consistently-directed sequence of edges that eventually loops back to v again. Equivalently, a DAG is a directed graph that has a topological ordering, a sequence of the vertices such that every edge is directed from earlier to later in the sequence.

Airflow DAG(Credit: Apache Airflow)

At a high level, a DAG can be thought of as a container that holds tasks and their dependencies and sets the context for when and how those tasks should be executed. Each DAG has a set of properties, the most important of which are its dag_id, a unique identifier amongst all DAGs, its start_date, the point in time at which the DAG’s tasks are to begin executing, and the schedule_interval, or how often the tasks are to be executed. In addition to the dag_id, start_date, and schedule_interval, each DAG can be initialized with a set of default_arguments. These default arguments are inherited by all tasks in the DAG.

Figure 1.1: Screenshots from the Airflow UI, Representing the example workflow DAG. Top Subpanel

In Airflow, all workflows are DAGs. A Dag consists of operators. An operator defines an individual task that needs to be performed. There are different types of operators available( As given on Airflow Website):

  • BashOperator - executes a bash command
  • PythonOperator - calls an arbitrary Python function
  • EmailOperator - sends an email
  • SimpleHttpOperator - sends an HTTP request
  • MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator, OracleOperator, JdbcOperator, etc. - executes a SQL command
  • Sensor - waits for a certain time, file, database row, S3 key, etc…

You can also come up with a custom operator as per your need.

Airflow’s Architecture

At its core, Airflow is simply a queuing system built on top of a metadata database. The database stores the state of queued tasks and a scheduler uses these states to prioritize how other tasks are added to the queue. This functionality is orchestrated by four primary components (refer to the Left Subpanel of Figure 1.1):

  1. Metadata Database: this database stores information regarding the state of tasks. Database updates are performed using an abstraction layer implemented in SQLAlchemy. This abstraction layer cleanly separates the function of the remaining components of Airflow from the database.
    2. Scheduler: The Scheduler is a process that uses DAG definitions in conjunction with the state of tasks in the metadata database to decide which tasks need to be executed, as well as their execution priority. The Scheduler is generally run as a service.
    3. Executor: The Executor is a message queuing process that is tightly bound to the Scheduler and determines the worker processes that actually execute each scheduled task. There are different types of Executors, each of which uses a specific class of worker processes to execute tasks. For example, the LocalExecutor executes tasks with parallel processes that run on the same machine as the Scheduler process. Other Executors, like the CeleryExecutor, execute tasks using worker processes that exist on a separate cluster of worker machines.
    4. Workers: These are the processes that actually execute the logic of tasks, and are determined by the Executor being used.
Figure 1.2: Airflow’s General Architecture. Airflow’s operation is built atop a Metadata Database which stores the state of tasks and workflows (i.e. DAGs). The Scheduler and Executor send tasks to a queue for Worker processes to perform. The Webserver runs (often-times running on the same machine as the Scheduler) and communicates with the database to render task state and Task Execution Logs in the Web UI. Each colored box indicates that each component can exist in isolation from the other components, depending on the type of deployment configuration.

Scheduler Operation

At first, the operation of Airflow’s scheduler can seem more like black magic than a logical computer program. That said, understanding the workings of the scheduler can save you a ton of time if you ever find yourself debugging its execution. To save the reader from having to dig through Airflow’s source code, we outline the basic operation of the scheduler in pseudo-code:

While the scheduler is running:Step 1. The scheduler uses the DAG definitions to 
identify and/or initialize any DagRuns in the
metadata DB.

Step 2. The scheduler checks the states of the
TaskInstances associated with active DagRuns,
resolves any dependencies amongst TaskInstances,
identifies TaskInstances that need to be executed,
and adds them to a worker queue, updating the status
of newly-queued TaskInstances to "queued" in the
datbase.

Step 3. Each available worker pulls a TaskInstance from
the queue and starts executing it, updating the
database record for the TaskInstance from "queued"
to "running".

Step 4. Once a TaskInstance is finished running, the
associated worker reports back to the queue
and updates the status for the TaskInstance
in the database (e.g. "finished", "failed",
etc.)

Step 5. The scheduler updates the states of all active
DagRuns ("running", "failed", "finished") according
to the states of all completed associated
TaskInstances.

Step 6. Repeat Steps 1-5

Web UI

In addition to the primary scheduling and execution components, Airflow also includes components that support a full-featured Web UI (refer to Figure 1.2 for some UI examples), including:

1. Webserver: This process runs a simple Flask application which reads the state of all tasks from the metadata database and renders these states for the Web UI.
2. Web UI: This component allows a client-side user to view and edit the state of tasks in the metadata database. Because of the coupling between the Scheduler and the database, the Web UI allows users to manipulate the behavior of the scheduler.
3. Execution Logs: These logs are written by the worker processes and stored either on disk or a remote file store (e.g. GCS or S3). The Webserver accesses the logs and makes them available to the Web UI.

Airflow Login Page

You will be able to see a lot of DAGs already present. These are the example dags that are present in the Apache Airflow image by default.

Airflow Web UI

Though these additional components are not necessary to the basic operation of Airflow, they offer functionality that really sets Airflow apart from other current workflow managers. Specifically, the UI and integrated execution logs allows users to inspect and diagnose task execution, as well as view and manipulate task state.

Command Line Interface

In addition to the Scheduler and Web UI, Airflow offers robust functionality through a command-line interface (CLI). In particular, we found the following commands to be helpful when developing Airflow:

  • airflow test DAG_ID TASK_ID EXECUTION_DATE. Allows the user to run a task in isolation, without affecting the metadata database, or being concerned about task dependencies. This command is great for testing the basic behavior of custom Operator classes in isolation.
  • airflow backfill DAG_ID TASK_ID -s START_DATE -e END_DATE. Performs backfills of historical data between START_DATE and END_DATE without the need to run the scheduler. This is great when you need to change some business logic of a currently-existing workflow and need to update historical data. (Note that backfills do not create DagRun entries in the database, as they are not run by the SchedulerJob class).
  • airflow clear DAG_ID. Removes TaskInstance records in the metadata database for the DAG_ID. This can be useful when you’re iterating on the functionality of a workflow/DAG.
  • airflow resetdb: though you generally do not want to run this command often, it can be very helpful if you ever need to create a “clean slate,” a situation that may arise when setting up Airflow initially (Note: this command only affects the database, and does not remove logs).

--

--