Create a dynamic Airflow DAG using a YAML file (or any other flat file)

Jun Wei Ng
4 min readMar 14, 2022

--

Sometimes, the workflow, or data pipeline, that we are trying to model in an Airflow DAG is not static — it changes under varying conditions. For example, an Extract-Transform-Load (ETL) pipeline that extracts from a varying number of input sources. In these situations, it would be implausible to recreate the DAG each time the condition changes — that would be highly manual and taxing for the team maintaining the Airflow DAGs. A better way to do this would be to build dynamism into the DAG. What if you could make the DAG change depending on a variable?

In this article, we will explore using a structured data flat file to store the dynamic configuration as a variable to implement a dynamic workflow. Note that the following discussion is based on Airflow version 2.

Photo by Ricardo Gomez Angel on Unsplash

Before we begin, using a structured data flat file file is not the only way to achieve a dynamic workflow, and it comes with its own set of pros and cons, which we shall dive deeper as we go along. Here are other methods, each with their own sets of pros and cons, that you can consider in place of using an external database:

Here’s an article summarising the comparison of this method against the above 5.

The how

When using a structured data flat file, such as JSON or YAML, , we can decide on a custom structure for our dynamic configuration.

Consider the following example workflow. The source files might all be dropped in a central location, and the DAG is responsible for re-locating them before perform the Extract-Transform-Load (ETL) pipeline for each source.

As the sources are only determined at runtime, the DAG will need to dynamically create the ETL task groups for each source present during runtime. There could even be no source files available on some days.

We can set up our DAGs folder as such:

dags/
|- configs/
| |- sources.yaml
|- .airflowignore
|- etl_using_external_yaml_file_dag.py

With the above project structure, we can retrieve our dynamic configuration from a YAML file in our DAG as such:

In line 35 to 38, we parse the contents of the YAML file to get the list of sources.

With our example sources.yaml file, we have the following DAG:

As the dynamic configuration now lives in a file that is stored on the same machine as the DAG files, we will need an external process if we want to make changes to the dynamic configuration. This could either be done directly in the file system by a developer manually, or via a deployment pipeline. In my opinion, these changes should not be done directly in the file system as that does not provide a change history.

Here is an example on how we can do the dynamic configuration changes using another Airflow DAG:

One good thing about using another DAG is that we kind of have a change history of the dynamic configuration. Of course, one could always make the manual change even with this DAG around, but that would be a violation of the process flow (user issue).

You might also have noticed the .airflowignore file in the DAGs folder. This file is necessary to let the Airflow scheduler know which files or folders to ignore when looking for Python files to parse for DAG updates. This will reduce DAG loading time and improve performance. In our example, our .airflowignore file will have the following content:

configs/.*

Benefits

The biggest benefit is that there is no additional load on any operational database. The retrieval of the dynamic configuration is executed purely on the machine that runs the Airflow scheduler process. No additional machine required in the retrieval process.

This method is also considered a best practice by Airflow when creating dynamic task workflow in a DAG.

Drawbacks

The biggest drawback from this method is that the flat file containing the dynamic configuration can only be viewed via a separate platform, such as the file system. This makes it a little more troublesome when it comes to debugging the dynamic behaviour of the DAG based on changes done to the flat file.

Lastly, dynamic changes might not be reflected instantaneously. These changes are only processed by the Airflow when the scheduler has parsed and serialised the DAG. In Airflow v2, the scheduler will need to serialise the DAG and save that into the metadata database. The webserver then retrieves the serialised DAGs from the database and de-serialise them. These de-serialised DAGs then show up on the UI, along with any updates to their workflow or schedule. As mentioned before, the frequency of update depends on the configuration of themin_file_process_interval setting of the scheduler.

Conclusion

Using a structured data flat file to store the dynamic configuration might seem like an easy implementation for a dynamic workflow, but it does comes with its own drawbacks. So be sure to weigh these out carefully before embarking on the journey.

The code snippet used is also available in this github repository.

Happy coding! (:

--

--

Jun Wei Ng

Software developer @ Thoughtworks. Opinions are my own