Generating Airflow DAGs using Jinja

Ali Masri
6 min readJul 13, 2022

--

Introduction

Apache Airflow is an open-source workflow management platform for data engineering pipelines. A pipeline in Airflow is known as DAG, which is short for Directed Acyclic Graph. Learning how to write DAGs is an essential skill for Data Engineers. But in some cases, we may need a method of automatically generating DAGs with code. In this article I will go over two methods for DAG generation and analyze the pros and cons of one over the other.

Use Case

Let us consider the following use case. We have a list of cities, and we want to generate weather reports for each. If the city is European, we will save the generated report in a file, otherwise we will just print it to the logs. Each report must live in its own DAG because we want to be able to pause/resume each DAG on demand.

Weather Reporting

In this tutorial we will use http://wttr.in/, a simple service that generates an elegant weather report in a textual format. Getting a report for a city is simple. All we need to do is to pass the city name in the URL. Here is the output of the following URL: http://wttr.in/chicago

http://wttr.in/chicago

Solution Plan

We will keep it simple by creating a function that calls wttr.in, then based on the city, we will either save the response output in a text file (if the city is European) or just print it to the logs.

DAG Generation

The Simple Way!

The easiest way to generate DAGs is to use a loop to create DAG instances with different parameters and task properties.

The following gist explains how to implement this for our use case.

And that is it! Here is the DAGs list in Airflow UI:

Airflow UI with generated DAGs

Now let us see the difference between London and Chicago DAGs.

London DAG

Notice the difference in the second task. Chicago has print_report since it is non-European, while London has write_report instead.

Drawbacks

Well even though the above method is clean and easy to follow, it suffers from two major problems.

The first problem is scalability, and it ties with the way the Airflow Scheduler works. On every scheduler heartbeat, Airflow parses all the python files in the DAG directory and executes all top-level code. A top-level code is everything that is not part of the execute method of an operator. In the case of multiple schedulers, files may be distributed and parsed in parallel. But since our DAG generation lives in one file, our code will be processed by one scheduler, and it will run the same code over and over on each heartbeat. Imagine how slow this is when auto-generating a lot of DAGs. This could even break your DAG generation because the file processing will exceed the file processing limit flag in Airflow.

The second problem is maintainability. If you look at the code section for a given DAG, you will see the code that was used to generate the DAG and not the DAG code itself. This problem becomes more tedious with a more complicated DAG.

DAG code

The Better Way!

If we think about the best scenario for our use case, it would be the following:

  1. No or very minimal top-level code
  2. The ability to see each DAG’s code in the Code tab in the UI
  3. The ability to scale-up without putting all the burden on one scheduler

A trivial solution is to write one DAG for each city, but this is not what you came here for! Instead of code repetition, we will write a code that generates a python file which contains the DAG code for each city.

The new plan is the following:

  1. Write a template file that the basic structure of the DAG code, with some placeholders to be replaced on runtime.
  2. Write a code that:
    2.1. loads the template file
    2.2. iterates over each city
    2.3. renders the template by passing the placeholders values
    2.4. write the output into python file
  3. Call the DAG generation code to generate our DAG python files

Template

There are many ways to define a template. The idea is to write the DAG code with some placeholders that will be replaced at runtime with the relevant value. In this article we will use Jinja to define our template.

Jinja is a fast, expressive, extensible templating engine. Special placeholders in the template allow writing code similar to Python syntax. Then the template is passed data to render the final document.

This is how the template looks like.

DAG template

Using Jinja templating, we can define place holders for the city name (line 8) and generate the python file with only the needed code using Jinja conditionals (line 23).

DAG generation code

To fill the placeholders and generate the python file, we will write a code that:

  1. Loads the Jinja template
  2. loop over cities and render the template by passing the city name, and the is_european flag.
  3. save the rendered template as a python file
DAG code generation

After running this code, three python files will be generated under your DAGs folder.

Auto generated files

Here is what Chicago and Paris DAGs look like:

Auto-generated Chicago DAG

What is interesting in the generated code is that each DAG has the minimal code for it to work. In Chicago DAG, since the is_european flag is false, there is there is no need to define the write_report function. Instead, only print_report is defined and called. And it is the exact opposite for Paris DAG.

On the other hand, we can see the same code in Airflow UI, without having to see and decipher the irrelevant generation code for each city values.

Pros

The advantage of this approach is that it solves all the drawbacks of the first method we covered.

  1. There is no top-level code
  2. If you navigate to the code section of each DAG, you will see only the DAG code and not the code that generates it
  3. You can go to each individual file and add changes, if necessary, without affecting other DAGs
  4. You can create multiple templates and load the relevant template based on your input
  5. DAG files can now be split across multiple schedulers for processing

But where to host this code? and when to run it?

There are multiple solutions for this. First, you can save it in a separate folder to avoid it being parsed by the scheduler. Then you can run it in your Docker file, CI/CD pipeline, or run it on demand. It all depends on how you plan to deploy your solution.

Conclusion

In this article, we covered two ways of generating DAG files. The first method uses basic loops to generate DAG objects, while the second is more advanced and uses Jinja templating engine to generate python files. Using Jinja, we can write powerful templates that minimize redundant code while helping with scalability and maintainability at the same time.

The Jinja template we covered here is basic, with more knowledge of Jinja you can go even further. Make sure to choose the right approach when generating your DAGs. Sometimes the number of DAGs generating is small, and it is not worth going with a templating solution. But templates are your choice for a more scalable approach.

--

--