Advanced Airflow: Scheduling Complex ETL Tasks

Patrick Cockwell
FLYR Engineering
Published in
7 min readJul 24, 2019

At FLYR we use Apache Airflow for many of our scheduled data ingestion jobs. Airflow is a scheduler and monitor with a graphical interface that displays workflows as Directed Acyclic Graphs (DAGs).

We’ve learned through trial and error that Airflow is great in some situations, but has concrete limitations that make it inefficient to use for certain applications. In this article we’ll discuss the benefits of using Airflow, solutions to common issues, and situations in which Airflow might not be the best tool to use for implementation.

Why Use Airflow?

Airflow is likely one of the best open-source schedulers available for straightforward ETL tasks. It has a gentle learning curve for simplistic tasks because it uses Python and is fast to start up, making it a great tool when you need to quickly set up a scheduler.

Open Source

Airflow is an open-source tool, meaning that its source code is available to anyone to use or modify. In other words, Airflow can be branched from its GitHub repository and adjusted to fit a company’s unique workflow. In addition, Airflow has an open-source community that submits improvements to the code base. Companies using Airflow can also potentially submit code improvements, and one of FLYR’s own employees has contributed to the Airflow project.

Easy Startup

Airflow’s DAGs are relatively simple to create and efficient to use once they’ve been set up. After a DAG has been created, Airflow’s GUI shows a visual history of workloads and readable logs, which makes debugging easier. The DAGs are also defined in Python, making the learning curve to use them gentle for many startup engineers.

Reliable Scheduling

DAGs in Airflow can be scheduled using multiple methods, whether that is for regular time intervals without a concrete starting point, according to CRON schedules, or just through one-time execution by using an @once tag. DAGs also support non-time based triggering (by simply omitting a schedule), in which case the DAG has to be triggered manually or by an external actor, whether that’s through another DAG, an HTTP request from a service, or by some other method.

Backfilling

Though backfilling is primarily relevant to ETL processes, it can be useful for other reasons as well. Backfilling allows Airflow to replay tasks as though the task were being run in the past. Backfilled tasks can be set to be dependent upon previous executions of DAG tasks or can be set to run when resources become available.

When using backfilling, it’s best to make sure your workload ensures data idempotency. Data idempotency is the guarantee that running a job will have the same end result regardless of the initial state of the data’s destination. In other words, as long as a job is data idempotent, the time at which you run the job won’t affect the results your job will produce.

Implementing Airflow

While Airflow is easy to set up when your use cases are simple, implementing complex tasks require some in-depth knowledge of how Airflow works. In this section, we’ll talk about common situations you might come across and some ways to efficiently implement Airflow.

Extending Data Formats

Because Airflow is open-source, you can extend the source code when you need Airflow to handle data formats it doesn’t inherently support. To do this, you will likely want to create subclass operators. Extending Airflow with subclass operators will allow you to avoid needing to copy boilerplate code in order to support your data format. Subclassing your operators will also improve the rigor and consistency of your nonstandard data formats, though it requires a bit more work.

Interpreting DAG Starting Times

Though Airflow has a quick learning curve, it also has some features that may make developers do a double take. Instead of logging the DAG process execution time when the DAG was started, Airflow logs the time as the start time minus one execution cycle, which is the amount of time in between DAG triggers. This means that DAG execution times are often recorded to have occurred in the past rather than at the actual time they began. While this isn’t detrimental to Airflow’s functionality as a scheduler, it can cause confusion for developers unfamiliar with the interface.

Rescheduling Sensors

Sensors take up a worker slot during their whole run by default, which means they use process threads until the task they are checking for completes. An unfortunate consequence of this default design is that the sensors and tasks can end up in a deadlock. The task that the sensor is checking on might be waiting for threads to be freed; however, the sensor task may be taking up the very thread resources the task it is waiting on requires to run. This can be resolved by setting the sensor to “reschedule” mode, which spins down the sensor completely and reschedules it to check for the task state at a later time.

Cross-DAG communication

Though DAGs function relatively well when they are running isolated tasks, as soon as a DAG’s trigger is dependent on another DAG, unexpected problems can occur. We’ll discuss how cross-DAG communication works, some common issues you might encounter, and solutions to those problems.

Airflow has a built-in cross-DAG communication tool called xcom that is similar to a Redis store in that it stores information as key and value pairs. The key values are often ID values that originate directly from a DAG and are stored in a PostGres database. This design works well when tasks are implemented using normal DAGs; however, a major detriment of xcom is that it doesn’t allow for DAG-to-subdag communication and vice versa. To better understand why this is the case, we’ll need to delve into the technicalities of how subdags function.

A DAG can be divided into tasks. It is possible for one of these tasks to be a DAG in and of itself — that type of task is called a subdag. Subdags have a limitation in the basic implementation of Airflow in that xcom cannot receive or pass messages to a parent DAG. Since subdags cannot access values from a normal DAG, they also can’t access xcom values that are stored using DAG IDs.

There are at least two potential solutions to this issue: Airflow can be extended to allow subdags to communicate with normal DAGs, or an external storage can be used to store artifacts.

DAG-to-DAG trigger communication through xcom also provides no traceability, so a more reliable method of triggering DAGs through other DAGs is by using task sensors. These task sensors will sit and wait for the completion of a DAG at a specific UTC date time. Though these sensors function better than xcom in some ways, they too have some caveats to consider. Sensors will only check for completion of a task at a specific UTC timestamp⁠ — they cannot look for completion within a timespan.

Addressing Constant Reinitialization

Airflow DAGs have two phases to their lifecycle: initialization and execution. Though execution only occurs when the DAG is triggered, reinitialization occurs every thirty seconds, which results in shared, generated variables being overwritten. To address this, you can store shared data in external sources as long as the data is not dynamic.

For example, any attempts to timestamp when a UUID was generated by a DAG will fail because every time the reinitialization phase is run, the DAG will generate a new UUID. In addition, any DAGs that should share variables will become mismatched over time as the DAG reinitializes. Instead of storing these within the DAGs, the data/artifacts should be stored outside of Airflow.

You should not, however, use this method to store dynamic data since it will result in mismatches between the stored data and the data expected by the DAGs. For example, if the number of tasks run by a DAG is stored in an external database and that number changes while the tasks are running, Airflow will reinitialize the DAG according to the number stored in the database rather than the actual number of running tasks. This can result in unknown behavior during execution.

When Not to Use Airflow: Distributed Computing

When you have a hammer, everything looks like a nail. Though Airflow is a useful tool for ETL processes, it’s easy to fall prey to the same mindset and to believe Airflow is the solution for everything.

While using Airflow, an important fact to keep in mind is that Airflow is a scheduler and is not designed for distributed computation. Airflow requires at least one worker to be up and running at any point in time, regardless of whether the worker is actually being used and these inefficient worker costs can add up when you use Airflow for many tasks. We explored this problem in more detail in another blog post that can be read here.

Conclusions and Recommendations

Airflow is an excellent scheduler to use for ETL tasks. It is free and one of the quickest ways to immediately implement a scheduler.

When using Airflow for complex tasks, make sure to put significant forethought into the design of any DAGs to ensure the tasks run as anticipated. Data should be set up so that it is idempotent in order to guarantee jobs will output the correct data and backfilling will work as expected. When Airflow is used in a data engineering capacity, we recommend following a Git paradigm for managing code. For automated CI/CD pipelines, normal code deployment paradigms are likely sufficient.

Our data engineering team works extensively with Airflow to implement ETL pipelines for airline data ingestion. We’re currently hiring.

--

--