Moving all our Airflow Architecture to Kubernetes

Yse Wanono
Inside Doctrine
Published in
3 min readMay 9, 2022

--

This blog post is about how we accelerated our daily tasks on Airflow leveraging Kubernetes.

Introduction

At Doctrine, our data pipeline runs with Airflow. We have a main DAG named high_priorities with almost 200 tasks that runs every day at midnight for 8 hours. That means that at 9 AM, when developers arrive at work, the new data have been ingested and cleaned so they are ready to work on it and Doctrine’s customers have access to fresh information on our legal intelligence platform.

Historical Airflow’s Architecture

Historically, we started with a monolithic architecture (and a mono repository for Python projects). We had a huge instance, named DataLoader, where our airflow worker was running (and all our tasks). We quickly moved our airflow’s webserver and scheduler to a separate instance.

Problems with this architecture:
- This is a Single Point Of Failure; when the machine had a problem, it’s all the data process that was blocked.
- Lack of Auto-Scaling; we paid for a huge machine running 24 hours a day, but the machine was not used all the time. On the opposite, it was difficult for us to launch a big stock of data processing once in a while and to increase memory and cpu of the machine on demand.
- Maintenance and Security; it was difficult and time-consuming to do updates in the machine (security updates, upgrades of python version or python libraries like airflow)
- Lack of Consistency and Integration with our stack; other applications at Doctrine run on a Kubernetes Cluster. Having our Airflow on Kubernetes as well allows us to remain coherent and integrates perfectly with our current stack.

In order to overcome these limitations we changed our architecture to rely on Kubernetes:

Kubernetes AutoCluster and KubernetesPodOperator

There was a huge work in order to port all our Tasks from Python Operator running on a single worker to KubernetesPodOperators running on our Kubernetes Cluster. We set up a Kubernetes cluster with Cluster Autoscaler turned on. This also implies virtualization of our data processing scripts (create a Dockerfile and build an image pushed to a registry), making sure that all our tasks are idempotents and independent from the local environment (no data written in local for instance).

Terraform and Helmify Airflow

This period was also a time for us to improve our Infrastructure as Code. We were working on a project in Tech to terraform all our infrastructure. Airflow was part of this project. You can read about our technical challenges in 2021 here.
We used helm chart from the repository “https://airflow.apache.org", located in the official Airflow’s github:

https://github.com/apache/airflow/blob/main/chart/Chart.yaml

We updated to Airflow 2. We moved from a CeleryExecutor to a KubernetesExecutor. We enabled PGBouncer to manage our Connection Pool with our Postgres Database. The number of connections per day dropped drastically:

We enable gitSync to pull all our Dags from our Github’s repository and configure a log’s sidecar with Amazon EFS in order to persist our logs. We used DataDog for the monitoring of our Kubernetes clusters and our running applications. We then benefited from our Datadog’s agent for the monitoring of our airflow tasks.

Conclusions

It’s now easy to manage, update and deploy Airflow on Kubernetes. We benefit from Kubernetes autoscaling ( we scale daily from 1 node to 60 nodes when necessary) and monitor everything with Datadog, so we have got alerts on Slack when tasks failed.

About the next steps, we are looking forward to trying the latest version Airflow 2.3.0, and testing Dynamic Task Mapping, which would help us to keep improving the scaling of our tasks and allow a way for a workflow to create a number of tasks at runtime based upon current data, rather than the DAG author having to know in advance how many tasks would be needed.

--

--