#2 Airflow in Production: a CI/CD approach for DAG deployment and infrastructure management

Published in

hipay-tech

7 min readJul 5, 2022

In a previous post called #1 Airflow in Production: our 1st steps towards a Modern Data Stack, we explained the choice of Apache Airflow as our orchestration solution, some of its basic concepts and the benefits we expect from it, both in the short and long term. Need a quick reminder ?

This new article covers how we continuously test, build and deploy Airflow software at HiPay. A number of choices were based on a technical context at a given time, and are likely to be reconsidered in the future.

Building quality software involves choosing appropriate tools and anticipating collaboration between teams

Sorry cloud lovers, we don’t use Cloud Composer (yet). Keep reading though, you might find some interesting information 😉

The first mission for Airflow at HiPay was moving data from on-premises PostgreSQL operational databases to BigQuery.

Due to HiPay’s network configuration (VPN) at the time and the fact that Google Cloud Platform’s public APIs were accessible from our servers, it was simpler to set up our Airflow instances on-premises.

Note: A specific article will discuss the benefits we gained from managing our Airflow instance by ourselves, on-premises, instead of using Cloud Composer.

Test, build, deploy software. Credits to flaticon.com

I. We aim for a production-ready orchestration solution

Thus, a non-exhaustive number of requirements must be met :

Continuous Integration: all new developments follow a common workflow (gitflow) and trigger appropriate code quality tests.
Continuous Delivery: manual actions are banished. All the software must be deployed by automatic pipelines, based on our version control system.
Development, Testing, Accepting & Production: Three Airflow environments are used to validate and then run Airflow software: testing, acceptance, production. Test Driven Development is done locally using Docker containers and pytest fixtures.
Reproducibility: goes without saying, essential for the DTAP logic to have any value
Fault tolerance: two production instances, on two different datacenters, are running in parallel and constantly updated during releases and hotfixes.

II. How we structure our Airflow code

Our Airflow software is both open-source packages AND internal developments

At HiPay, we separated Airflow infrastructure management and DAG code in two kinds of projects (i.e. different Gitlab projects):

A Core project, responsible for building our Airflow instances. It mainly consists in infrastructure setup, deploying DAGs and dealing with Airflow upgrades.
ELT projects, providing what we call “TaskGenerators”, to be integrated in Airflow DAGs.

Task groups are the key to the delivery independence of ELT projects

As the complexity of a data pipeline increases, the number of lines to implement it with Airflow will explode. Now imagine that different teams are working on the same DAG, using a single Python file is not going to cut it at all.

Among the options available, we chose to use TaskGroups to solve this. Some very good posts cover TaskGroups concepts and implementation, especially this guide from Astronomer.

Without going into detail, the main benefit we achieved is that TaskGroups act as APIs between ELT projects. Thus, the complexity of the data pipeline is broken down into smaller parts, and the final DAG file is greatly simplified.

Assuming that the parameters at the DAG level do not change regularly, and that Project_A and Project_B keep their respective names and structures, the DAG Python file will not be subject to many changes.

On the other hand, ELT projects will constantly evolve . This is especially true when Python functions are called from PythonOperators.

III. CI/CD pipelines

Infrastructure: test and execute Ansible roles

Ansible is HiPay’s standard for on-premises infrastructure setup. It is a configuration management tool, with a procedural syntax (as opposed to the declarative syntax of Terraform). In a nutshell, Ansible roles can be seen as Shell scripts on steroids, that can be run from a remote location through SSH.

We implemented Airflow instances configuration from scratch using Ansible:

Unix users management (e.g. creating an airflow user and its home directory)
Debian packages installation
Python packages installation, including Airflow
Files deployment (configuration, services, DAGs …)
Airflow backend database configuration
Restarting Airflow services when required

Extract of our Airflow configuration role

Even though a project to “aid in the development and testing of Ansible roles” — Molecule — exists, we found it difficult to fully test our roles without dedicated environments. Hence the role of our test instance, used during merge requests.

Ansible helps us maximize the reproducibility and idempotency of our deployments, while inventories allow us to implement the required differences between environments. Combined with Gitlab CI pipelines, it gives us a convenient way to achieve continuous delivery for the Airflow infrastructure.

Ansible enables us to implement a DTAP logic for Airflow infrastructure

Data pipelines: test, build and deploy Python packages

As of now, delivering ELT projects to Airflow instances translates into two kinds of actions:

Updating scripts: mainly Python, templated SQL and JSON files.
Updating environment variables. These variables are used within Airflow DAGs by operators. They were initially materialized by .env files, and are getting replaced by Airflow variables.

A variety of tools, such as RSYNC, can update scripts on a distant server, and Ansible provides a nice integration with most of them. However, since we are delivering Python projects, we can hope for the best if we leverage Python packages to distribute our code.

To secure this package distribution, the first element we needed was a private Python package repository. GCP had our back here with Artifact Registry. This managed service made it easy to authenticate users and to give them the appropriate permissions.

Artifact registry helps us store and secure private Python packages

During releases, we take advantage of our DTAP logic to validate the behaviour of an updated DAG in the acceptance environment before moving to production. Depending on the changes brought by a new release, this validation step can take from a few minutes to several days.

To sum it up, the full CI/CD workflow for a internal Airflow package looks like the following:

IV. Adaptation to change: what if?

What if we change our use of TaskGroups in the future?

The flexibility and delivery independence that task groups offer us is convenient for now, but may not be enough when things get tough:

We may need to recreate task-level dependencies for performance reasons
We may find that delivery independence hurts the quality of an entire pipeline, and that TaskGenerator versions need to be controlled from the Core Airflow project

Instead of calling the .create_task_group method of TaskGenerators, we can call the tasks creation functions from the DAG, and set all dependencies here. If the DAG file becomes too large, we can always split it into smaller parts, but in the core project this time.

Regarding the technologies involved in Gitlab CI pipelines, building and deploying Python packages will be necessary as long as we use a Python orchestration solution.

What if we move to Cloud Composer?

As we aim to get most of GCP managed services, we might switch to Cloud Composer in the future. A few thoughts regarding this migration:

Infrastructure will be implemented using Terraform scripts
Airflow executor will change from LocalExecutor to KubernetesExecutor
PythonOperators should be replaced by KubernetesPodOperators
TaskGenerators will be distributed either using pip and Artifact Registry, or by uploading files to Cloud Storage buckets. Using a custom docker image to run Cloud Composer does not appear to be supported at this time

Putting it all together, only our Core Airflow project should see major changes. Since the backbone of Cloud Composer is a Kubernetes cluster, we can expect Python ETL projects to be delivered as Docker images. However, since we’ve tried to use as few Python operators as possible from the beginning, this change should be rare. Instead, we primarily use BigQuery and Google DataFlow processing to scale our data pipelines.

Thanks for reading ! Give us a little 👏 if you found this to be useful ! or leave a comment if you want more focus put on specific aspects in the future 😉

By the way, we are always looking for bright people to join us, check out our open positions !