Scaling Airflow on Kubernetes: lessons learned

Published in

The Qonto Way

8 min readNov 3, 2021

Introduction

At Qonto, we used the summer months to migrate from our previous ETL, developed in-house in Python, to an Airflow-based stack hosted on an EKS cluster (i.e. a managed Kubernetes cluster in the AWS cloud).

We had successful prototypes and we conducted all the analyses to check that this change would help us scale as we wanted. Our initial Kubernetes cluster had a minimum size of 2 nodes and it could scale up to 5 nodes. Each node had 4 vCPU and 16 Go of RAM. Airflow ran a handful of DAGs amounting to less than a hundred tasks. Regarding Airflow, we were using the second major version configured to use the Kubernetes executor. This was our starting point.

Our ambition was to port our ETL to Airflow, which was estimated at around 150 DAGs and 800 daily tasks, plus some hourly, weekly, and monthly DAGs and tasks. We prepared as well as we could to do this migration, however, this didn’t prevent us from encountering a few issues as the workload scheduled by Airflow grew. In this article, we’ll share some of the facts we learned. Our Airflow instance now handles more than 250 DAGs and over 1300 tasks every day on more than 15 nodes.

Our ambition is to keep on scaling: now that the Data Engineering team has a good mastery of this stack, we are thinking about opening up the use of Airflow to other teams at Qonto (data science,…). This is only the beginning of Airflow at Qonto!

Resource management

Chose your limits wisely

When in doubt, set your limit high rather than low

Kubernetes can be merciless. A pod that overconsumes resources will get killed, and fast. Reading the following Airflow log message can sadly become routine in the morning:

[2021-09-17 13:31:17,870] {local_task_job.py:149} INFO - Task exited with return code Negsignal.SIGKILL

We quickly learned that we needed to iterate to find the proper values for the ideal limit parameter of our pods. Basically, start with a high value and use your monitoring tools to survey actual consumption. At Qonto, we use a Prometheus and Grafana stack for our monitoring needs.

Grafana dashboards will help you monitor your ETL

Once you have a good vision of the actual resource consumption, you adjust the limit and request accordingly (but always keep some margin).

Also, be careful that you take into account cyclical behavior. For example, we woke up one day to find most of our tasks killed by Kubernetes because we had set our limits according to a “normal” day’s consumption. However, on the first day of each month, we ingest almost 50% more data in some tables, which leads to increased memory consumption.

Requests and limits…

Make your requests and limits as close as possible

On the topic of memory consumption, we started by setting the pod limit really high (10Go for a pod that used around 1,5 Go on average) combined with a low request (at 1Go). This lead to unreliable scaling behavior for our Kubernetes cluster.

The reason is, when scheduling pods onto nodes, Kubernetes doesn’t know the “real” consumption of the pod to schedule, only its request. So for example, if you have a node with 16 Go of RAM, of which 12 are already used, and you want to schedule a pod that has a request at 2 Go but that can consume up to 5 Go, Kubernetes won’t see an issue.

Now imagine you have a batch of tasks that starts, with a hundred pods to schedule onto nodes. Imagine all the pods are similar to the one described above (request at 2Go, consumption up to 5Go). Can you guess what’s going to happen? Kubernetes is going to schedule 8 pods per node before scaling up, which makes sense given the information he has. However, in such a situation, our pods won’t be able to consume their "normal" 5 Go, which will lead to bottlenecks, failures, or at least delays.

In the pod definition templates, the request parameter should be kept as close as possible to the limit parameter. At Qonto, we now enforce a maximum ratio of 1.5 between the two: the request cannot be below two-thirds of the limit. For critical components such as Airflow’s schedulers and the webserver, we even set both parameters to the same value.

Pod template files to the rescue

Use pod template files to optimize resource consumption

As you can imagine, some tasks require a lot more memory than others (based on the volume of data, velocity, variety, etc.). For example, a task that gets data from our production databases, modifies it, and loads it in our warehouse (ETL model) will use a lot more memory than a “sensor” task that only checks if/when a given table has been updated. It doesn’t make sense to have the same resource configuration for all of them.

In order to avoid filling our cluster with tasks that have big requests/limits but don’t actually use them, we use several pod template files. We have one pod template file adapted to larger tasks, and one for smaller tasks.

Those pod templates are then specified through the executor_config parameter that we pass to our operators. We specify defaults values that we can override for tasks that require specific resources.

DEFAULT_ARGS = {
	...
	executor_config: {"pod_template_file": "default-file-template.yaml"}
}dag = DAG(
	...
	default_args={**DEFAULT_ARGS, **DAG_PARAMS}
	...
)

That way, we can have finer resource allocation and optimize our cluster’s resource efficiency (detailed information on this can be found in Airflow’s documentation).

Resource configuration is not a “once-and-done” job

Monitor, and ideally alert, resource consumption continuously

Qonto is a company that grows quickly. Over a few months, we saw that some of our tasks’ resource consumption is growing at the same rate, which makes sense: the number of customers, of transactions, of payments, basically everything is increasing.

This means that we’ll have to measure and monitor that resource consumption increase over time, so we can anticipate when some tasks need additional resources before they actually hit the limits and get killed by Kubernetes. Again, we rely on our monitoring stack (Prometheus/Grafana) for that.

Know your monitoring stack

Be aware of your tools’ capabilities and limits

You should rely on your tools for efficient resource monitoring, and that means knowing their configuration and how they work!

Here’s a concrete example of a trap we fell into because we didn’t master our monitoring stack as well as we should have. Our Prometheus/Grafana stack has a polling rate of 30 seconds, a relatively standard value. Some of our tasks have a duration of around 45s, which means those tasks only have one point of measure of their resource usage (see illustration below).

Depending on what the tasks does, the resource consumption might not be reliable. In our case, the pod received a significant amount of data after the measurement. So the memory increased significantly between the Prometheus measure and the pod’s end of life, which lead to setting the limit and request parameters too low, getting our pods killed and our ETL late!

Running our ETL on Kubernetes

Cluster elasticity

Use an elastic cluster rather than a fixed-size one

As an ETL’s workload is not constant over time by nature (an ETL is a batch job), using an elastic cluster makes sense. That way the resource capabilities can adapt: scale-up at peak workload time and scale back down when done, saving money.

Pod lifecycle

Tailor Kubernetes’ behavior to be compatible with an ETL

Even when there are no errors, Kubernetes can, and will, kill pods as part of its normal behavior. This can happen on several occasions, for example during a scale-down: when the Kubernetes autoscaler detects that the resource usage is low enough that one node can be turned off, it will drain all pods from the node before shutting the node down.

While this makes sense for webservers or other components with an indefinite lifetime where several identical pods are running in parallel, this is not well suited to our batch job tasks where all tasks run once and have a limited lifetime. It’s frustrating to see Kubernetes killing a 45 min data loading task just two minutes before completion! And it’s even more frustrating when an Airflow bug prevents those tasks from being retried properly and you have to do it manually in the morning.

Including the following annotation to your pod template file will tell Kubernetes that your pods shouldn’t be killed under “normal” situations:

"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"

There are other sources of pod destruction, make sure to monitor and deactivate them as they arise. For example, AWS EKS clusters have a feature, activated by default, called the “AWS AZRebalance”. This feature will turn a node off to replace it by a similar node in another AZ when it detects an imbalance in your nodes’ Availability Zones. This leads to Kubernetes draining your pods from the node being shut down, even if they have the safe-to-evict annotation!

This feature can be deactivated through the suspended_processes parameter in the Terraform "eks-worker" module.

Conclusion

Using Airflow on Kubernetes can be complex (as we have shown), but it also brings many advantages:

Almost limitless scaling potential: Kubernetes has tremendous scaling capabilities, and while Airflow can introduce additional scaling limits, the Airflow team is hard at work to improve those. For example, one of Airflow 1.0’s bottlenecks was the scheduler, which is why Airflow 2.0 included a horizontally scalable scheduler.
Total flexibility regarding the tasks you want to run: Kubernetes is natively container-oriented and can run anything, as long as it is containerized. Using pod-template files, you can easily use any Docker image for your tasks. No more conflicting dependency issues! You can run parts of your ETL in Python 3.6, Python 3.8, Go and even Bash scripts if you feel like it.
Flexible resource usage: Airflow integrates with Kubernetes to allow dynamic cluster scaling, which lets your cluster automatically scale up when a large number of tasks are to be run and scale down in slower times, saving money. You can set limits based on your constraints.
And of course, using Airflow rather than our own in-house scheduler eases recruiting and training since Airflow is now a standard in the Data world.

At Qonto, we firmly believe that the move to Airflow on Kubernetes is worth it, and we have ambitious plans to keep on scaling our Airflow instance to meet all our business needs.

We hope this article has helped you avoid some common traps when you take your first steps with Airflow on Kubernetes. As we plan to keep on growing, we also plan to keep on sharing, so stay tuned for more tips as we learn! And of course, don’t hesitate to reach out to us if you have questions or want to dig deeper!