Using environment variables to control a dynamic Airflow DAG

Jun Wei Ng
4 min readMar 14, 2022

--

Sometimes, the workflow, or data pipeline, that we are trying to model in an Airflow DAG is not static — it changes under varying conditions. For example, an Extract-Transform-Load (ETL) pipeline that extracts from a varying number of input sources. In these situations, it would be implausible to recreate the DAG each time the condition changes — that would be highly manual and taxing for the team maintaining the Airflow DAGs. A better way to do this would be to build dynamism into the DAG. What if you could make the DAG change depending on a variable?

In this article, we will explore using environment variables to make the DAG change its workflow based on such a variable. Note that the following discussion is based on Airflow version 2.

There is one thing we need to get out of the way first: the definition of a dynamic workflow. A dynamic workflow could be viewed as one that changes behaviour on the fly. It could also be viewed as one that changes behaviour depending on the environment the DAG runs on. For all further discussion in this article, we will only consider the former, where we would like to achieve the outcome of dynamically changing the DAG workflow on the fly.

Before we begin, using environment variables is not the only way to achieve a dynamic workflow, and it comes with its own set of pros and cons, which we shall dive deeper as we go along. Here are other methods, each with their own sets of pros and cons, that you can consider in place of using Airflow variables:

Here’s an article summarising the comparison of this method against the above 5.

Photo by John Carlo Tubelleza on Unsplash

The how

Using environment variables to achieve dynamic configuration of an Airflow DAG is very similar to how we use Airflow variables to do so.

Consider the following example workflow. The source files might all be dropped in a central location, and the DAG is responsible for re-locating them before perform the Extract-Transform-Load (ETL) pipeline for each source.

As the sources are only determined at runtime, the DAG will need to dynamically create the ETL task groups for each source present during runtime. There could even be no source files available on some days.

We can use environment variables to create the dynamic workflow as such:

In line 27 and 28, we use the os.getenv function to retrieve the dynamic configuration in string form from the environment variables, and json.loads to convert that into a list of sources.

For example, when we set the following environment variable:

SOURCES=["foo", "bar"]

We get the following DAG:

Benefits

The biggest benefit is that there is no additional load on any operational database. The retrieval of the dynamic configuration is executed purely on the machine that runs the Airflow scheduler process. No additional machine required in the retrieval process.

In addition, changes to the dynamic configuration can be done easily either via a deployment pipeline, or via another Airflow DAG. This allows developers to quickly make dynamic changes to the workflow.

Drawbacks

The biggest drawback is the difficulty in changing the dynamic configuration. When the Airflow scheduler process runs, it inherits the set of environment variables from the operating system. Thus, any change to the dynamic configuration will most likely require a restart of the Airflow scheduler process. Depending on how your Airflow cluster is set up, it might be as easy as spinning down and spinning up the Airflow scheduler instances. But that would mean a little downtime for your other DAGs as task scheduling will be put on hold until your scheduler comes back up.

Secondly, the dynamic configuration can only be viewed via a separate platform, such as an open shell on the Airflow scheduler instance. This makes it a little more troublesome when it comes to debugging the dynamic behaviour of the DAG based on changes done to the environment variable.

Lastly, dynamic changes might not be reflected instantaneously. These changes are only processed by the Airflow when the scheduler has parsed and serialised the DAG. In Airflow v2, the scheduler will need to serialise the DAG and save that into the metadata database. The webserver then retrieves the serialised DAGs from the database and de-serialise them. These de-serialised DAGs then show up on the UI, along with any updates to their workflow or schedule. Then again, as the scheduler instance(s) needs to be restarted, and the scheduler parses the DAGs folder upon start-up, this last drawback might be a moot point.

Conclusion

Although we can use environment variables to dynamically configure a DAG’s workflow, it has several drawbacks that make it not ideal. It might have more application as a feature toggle for different environments, where the environment variables are static across DAG runs and vary across the environments.

The code snippet used is also available in this github repository.

Happy coding! (:

--

--

Jun Wei Ng

Software developer @ Thoughtworks. Opinions are my own