6 ways to write a dynamic Airflow DAG: Which one fits your use case?

Jun Wei Ng
3 min readMar 13, 2022

--

Airflow makes it easy to model data processing pipeline using a Directed Acyclic Graph (DAG). These DAGs are made up on tasks, which take the form of operators, or sensors. These tasks can even be custom-made to fit the requirements of the DAG. While most DAGs are static, meaning they don’t change their flow under varying conditions, some DAGs might need dynamism when processing data.

Consider the following example DAG. The source files might all be dropped in a central location, and the DAG is responsible for re-locating them before perform the Extract-Transform-Load (ETL) pipeline.

As the sources are only determined at runtime, the DAG will need to dynamically create the ETL task groups for each source present during runtime. There could even be no source files available on some days. So how do we make the DAG handle this dynamism?

There is one thing we need to get out of the way first: the definition of a dynamic workflow. A dynamic workflow could be viewed as one that changes behaviour on the fly. It could also be viewed as one that changes behaviour depending on the environment the DAG runs on. For all further discussion in this article, we will only consider the former, where we would like to achieve the outcome of dynamically changing the DAG workflow on the fly.

There are 6 ways we can wire dynamism into an Airflow DAG:

This article will not go in depth on the pros and cons of each of the above methods. Click on them to get more details. Instead, this article will provide a quick comparison of these methods. Note that the following discussion is based on Airflow version 2.

TL;DR

Here’s a rough summary of the pros and cons of each method:

Change on-the-fly: The ability to make changes to the dynamic configuration on demand.

It is a maybe for nested operators as nested operators can use any of the prior 5 methods to achieve dynamism, and if it uses environment variables, it will lose the ability to make changes on-the-fly.

Ease of debugging: The amount of effort required to find the root cause of issues when misconfiguration happens.

It basically boils down to how easy can a developer visualise the impact of a change to the dynamic configuration on the affected DAG.

Additional load on ops DB: Additional database accesses on operational databases, such as Airflow metadata database, or an external database required to keep operations going.

This happens when there are database accesses done on the top-level code of the dynamic Airflow DAG.

Responsiveness to changes: How quick does a change to the dynamic configuration gets parsed and serialised by the Airflow scheduler.

The Airflow scheduler parses the DAGs folder periodically so as to keep the DAGs up-to-date with changes to its workflow or schedule. These parsed DAGs are serialised into the metadata database. DAGs that directly relies on an external configuration to achieve a dynamic workflow will need to wait for the Airflow scheduler to parse and serialise them before the changes are effective.

There are many ways for us to create a dynamic Airflow DAG, but each has its own pros and cons. Make sure to weigh these out when choosing which method to create your dynamic DAG. Do check out the articles above for more details on each method and its pros and cons.

Happy coding! (:

--

--

Jun Wei Ng

Software developer @ Thoughtworks. Opinions are my own