#1 Airflow in Production: our 1st step towards a Modern Data Stack

Published in

hipay-tech

8 min readJun 15, 2022

This article is the first in a mini-series to share HiPay’s adoption and use of Airflow. It aims to describe our approach to orchestrating modern data pipelines, and may include basic considerations for Airflow experts.

TL;DR: thinking about how to modernize our data stack, we found that starting with orchestration was very convenient. In this story, we share a few considerations regarding Airflow’s capabilities to help modernizing data pipelines inside an established ecosystem, as a first step towards migrating to a Modern Data Stack.

I. A bit of context

HiPay engineers typically sit between two environments on a daily basis:

On-premises: legacy HiPay core services
Cloud: new services, preferably cloud-native applications

Data pipelines at HiPay

Among HiPay tech tribes, Business Intelligence and Data Science teams are in charge of building solutions that leverage data for internal and external needs. Building these involves interacting with a diversity of components, located in both on-premises and cloud infrastructures.

If we combine this diversity of components with needs for observability, control, history catch-up, parallelization, security — to name but a few — any software engineer would quickly wish an orchestration tool to be available.

II. Hello Airflow!

“Apache Airflow is an open-source workflow management platform for data engineering pipelines. It started at Airbnb in 2014 as a solution to manage the company’s increasingly complex workflows.” — Wikipedia

Why Apache Airflow ?

As new competitors announce themselves frequently, one can question the choice for a defined technology with reason. Here are several of the major arguments that made us pick Apache Airflow over Dagster or Prefect:

Open source, top-level Apache Software Foundation project, adopted at a global level. That alone pretty much says it: the solution has proved to be efficient, with an active community. It means that we’ll be able to copy/paste stuff from StackOverflow — yay !
Written in Python. This is an important point for us, who already use a lot of Python for our projects, as the control of our technical debt is non-negotiable
HiPay chose Google Cloud Platform as its major Cloud Provider. Due to GCP’s efforts to make all of its services available through APIs, official Airflow operators are actively developed and maintained by Google Cloud engineers. This means that when we run entirely on GCP managed services, our orchestration flow will be just a sequence of API calls, with minimal internal development: massive good news regarding technical debt again. GCP also offers a managed version of Airflow, Cloud Composer, that we plan to use in the future (this point will be explained in a specific post)

Airflow is the most popular open source orchestration solution of 2022

What’s a DAG anyway?

Well, this is an acronym that you might come across a number of times if you read about orchestration solutions in general, but it is especially true with Airflow. So it’s best to understand this concept before moving on 😉.

First, a few links to save you some Google searches: Airflow concepts, Airflow tutorial.

“A DAG (Directed Acyclic Graph) is the core concept of Airflow, collecting Tasks together, organized with dependencies and relationships to say how they should run” — Airflow Official Documentation

The DAG concept does not imply any coding consideration. In the end, the goal is simply to create an organized graph of tasks, and to execute this graph following a schedule.

Here is an example of a DAG that any data team could come-up with during design sessions:

Airflow Operators

If linking blocks doesn’t seem so scary, coding the blocks themselves is another story. To achieve this goal without blowing up your technical debt level, Airflow introduces “operators”. An operator can be seen as a handler of a remote resource, to which a developer only needs to provide the required configuration. Here’s a list of the available operators.

As an example, the BigQueryInsertJobOperator makes it possible to schedule SQL queries within a BigQuery instance with about ten lines of code.

Authoring DAGs with Airflow

Writing Airflow DAGs comes down to three main development tasks:

Create a DAG object. This is where DAG-level configuration is done: scheduling, default parameters, dag_id, etc.
Pick appropriate operators and create tasks using them
Create relationships between your tasks

If the structure of the DAG is correct and Airflow is installed correctly, Airflow will be able to import and schedule it. Returning to the hand drawn DAG above, here is its translation into Airflow operators:

Above DAG translation into Airflow operators

But ? …. Wait ! Yes, you’re right, Google may not have made rocket launches available via its APIs yet. In other words, we need to be able to import third party operators to develop our DAG.

Improving teamwork with Airflow

Airflow DAGs development is mainly creating tasks from available Python packages. This is where an orchestration solution written in Python reveals its might:

Internal or external teams create and distribute Python packages
DAG developers import operators from those packages to create tasks and link them together

By leveraging Python packages distribution, separate teams can use Airflow to materialize their inter-dependencies with nothing but arrows between blocks. Simplicity at its best!

Airflow helps materializing cross-team collaboration

In addition to Python packages, technologies like Docker and managed cloud services allow developers to run just about anything from an Airflow DAG, and largely reduce the complexity of Python DevOps operations (Python library management, virtual environments, etc). HiPay’s way of deploying applications with Airflow will be the subject of a specific post.

Learning Airflow

Quite a point here. Once you have read the basics in the Airflow documentation, you may be a little lost when opening your favorite IDE. While there are a number of articles that will help you understanding the basic concepts of Airflow, the learning curve can be steep if certain prerequisites are not validated:

Coding Airflow DAGs is coding in Python. Therefore one needs to master the basics of Python development, virtual environments, package import and distribution before aiming at landing Airflow projects in production
Coding Airflow DAGs is NOT coding real tasks themselves. This may seem trivial but it is crucial. Example: to execute a SQL Query from an Airflow DAG, all of the complexity resides in SQL development and requires SQL skills. Airflow is just a tool used to define its execution modalities and eventually trigger the execution. This distinction is true for any Airflow task which requires custom application development (even for Python apps !)
Operating Airflow in production implies quality CI/CD pipelines. Even if using Cloud Composer. Do not be fooled by short tutorials or videos, DevOps skills are as critical as for any other software. Bash, Ansible, Docker, Terraform, GitOps, you name it

The first real important part of learning Airflow is assimilating the Airflow job design philosophy, in particular:

There must be no scheduling within tasks. Implications are quite large here. As an example, any time-based ETL/ELT job should be a function accepting some date_min and date_max parameters, so it can be called from a DAG. Since tasks are run within DAG runs, you have DAG run variables at your disposal, such as data_interval_start and data_interval_end (see Airflow templates reference). Say goodbye to the good old loops like:

for day in list_of_days_to_extract: 
    extract_stuff()

Note: the reverse is also true, no external action must be done by your DAG’s Python code outside of scheduled tasks, since your DAG code gets executed at every heartbeat of Airflow’s scheduler.

Functional programming. To operate functional data engineering in the end. Tasks should be calls to pure functions, no matter what technology is used under the hood. This translates into at least two principles that any task should abide by:
– Idempotency: executing a task with the same parameters multiple times produces exactly the same result. This one has amazing implications down the road
– Atomicity: the task succeeds completely, or fails entirely (e.g: no intermediate write operations in a SQL table in your ETL scripts)
No dependency on past executions. This is in fact determined by idempotency. It’s also critical to leverage Airflow task execution parallelism when performing large history catch-ups
Airflow’s creator Maxime Beauchemin explains a set of best practices an introduces Functional Data Engineering in this video.

Provided that tasks are properly designed, you can try to create you first DAGs and begin to answer operational needs. The learning curve should eventually be about improving DAGs and tasks (i.e. robustness, logging, performance, CI/CD pipelines…), and more importantly about designing and implementing collaboration between separate teams within shared data pipelines.

Expected Airflow long-term learning curve

As with the adoption of any new technology and even after a year and a half of heavy use, we expect to encounter obstacles and to make many mistakes. Nevertheless, it is undeniable that we are already seeing a sharp increase in the resilience and performance of our data infrastructure thanks to Airflow. This is a consequence of :

Centralized monitoring and control, all pipelines are observable from a single interface
Simple and reactive alerting in Slack when something goes wrong
A standard set of tools and practices for data pipeline authoring. Writing workflows in Python code makes them extremely flexible
Faste and robust development iterations thanks to the use of Airflow Operators
Parallelization of task execution, which allows us to easily perform large history catch-ups
Better secret management and security

III. What’s coming next for us ?

As we plan to expand the role of Airflow within our infrastructure, it is important that we remain open to opportunities to improve our stack and our practices. To that end, we will keep an eye on Airflow’s evolution, especially its unbundling and the evolution of its place in the Modern Data Stack.

Getting a robust Airflow instance up and running in production laid the foundation for migrating the rest of our data infrastructure. We are currently working on the ingestion and transformation layers. The cool part is that Airflow integrates seamlessly with the key players in both of these layers.

On a side note, Airflow helped us migrate important volumes of data incrementally from on-premises PostgreSQL databases to Bigquery. As all our new data pipelines are scheduled and monitored using Airflow, we already avoided the painful hell of managing dozens of cron jobs.

Thanks for reading ! Give us a little 👏 if you found this to be useful ! or leave a comment if you want more focus put on specific aspects in the future 😉

By the way, we are always looking for bright people to join us, check out our open positions !