Airflow: Scaling with AWS EKS

Harsh Mishra
PlaySimple Engineering
7 min readAug 10, 2022

“You can focus on things that are barriers or you can focus on scaling the walls or redefining the problem” - Tim Cook

As an organisation, PlaySimple Games is growing significantly and so is the data. The Data team at Playsimple handles more than 150 DAG’s and over 5000 tasks every day on Airflow. As the saying goes, “with great scale comes great responsibility”.

As a team, one of the major challenges that we face is to provide an infra which is capable of handling the increasing workload and is also cost-efficient at the same time. To achieve this, we needed an infra that can scale up or down depending on the demand.

As our ambition is to always keep scaling… 🚀

Intending to scale our Airflow infra, we moved it to Kubernetes and later enabled auto-scaling on the same. It simplified our resource management which otherwise needed a lot of human effort. It also helped optimise our resource cost and usage by automatically scaling a cluster up and down in line with the demand on Airflow.

But it was not as easy as it seems. Let me take you back in time….

Earlier our services used to fail during peak demand because they do not have enough resources available to handle the spike. This would also lead to Airflow getting down and thus affecting other services as well. To solve for this we had to scale things manually which was a time-taking and expensive procedure. “This is where auto-scaling comes into the picture.” It is an automated process that adjusts capacity for predictable performance and costs.

Auto-scaling is one of Kubernetes's major features as it simplifies resource management which would otherwise need a lot of human effort. It helps optimise resource cost and usage by automatically scaling a cluster up and down in line with the demand.

For example, if a service in production experiences a greater load during certain times of the day, Kubernetes can automatically increase the cluster nodes to handle that change, and then when the load decreases, it can adjust back to fewer nodes, which will save on both resources and spending. Hence, it ensures that…

WE ARE ALWAYS RUNNING” 🏃 🏃 🏃

Now, we were aware that we have to start scaling our Eks cluster but had no clue on how to move forward. Scaling the Kubernetes cluster was not that easy. Specifying resource request and limit for every task/pod on Kubernetes was one of the greatest nightmares.

What if anything breaks? What if we end up paying a lot more cost? How are we going to set limits for 5000 airflow tasks? What if we end up allocating way more resources than needed? What if...? How will we…? There were a lot of questions crossing our minds. A lot of work went in but finally, we were able to achieve what we wanted.

We use AWS Elastic Kubernetes Service to run our Kubernetes applications. Let us now discuss in detail how we can scale EKS. But before deep diving into it, let’s understand what resource requests and limits in Kubernetes are.

What are Resource Requests and Limits in Kubernetes?

When specifying a pod in K8s, one can optionally specify how much of each resource it needs. The most common resources to specify are CPU and memory (RAM).

When a resource request is specified for a pod, Kubernetes uses this information to decide the node to schedule this pod on. When a limit is also set for a container Kubernetes ensures that the running container is not allocated more resources than the set limit.

Fig. Resource Requests and Limits in K8s

Note: If a pod requests more resources than the resources available in an empty node. It will never get scheduled.

Now let us understand how

How did we do it?

Setting Up and Deploying AWS Cluster Autoscalar

The very first step is to deploy Cluster auto-scalar as a deployment on your Kubernetes cluster. The Kubernetes Cluster Autoscaler automatically adjusts the number of nodes in your cluster when pods fail or are rescheduled onto other nodes.

The setup is very easy and can be completed in a few steps. For more details refer to the following documentation: https://docs.aws.amazon.com/eks/latest/userguide/autoscaling.html

Higher Minimum Worker Nodes on Cluster

Always set higher minimum worker nodes for a node group in the EKS cluster when starting with auto-scaling. This will prevent your system from breaking in case your cluster autoscalar is not working properly or you have less idea of how the scalar in Kubernetes works. You can always go back and reduce this number once you are sure that the system is working properly.

Resource Addition to Airflow Tasks and Pods

Now comes the most important step which is to add resource requests and limits to your airflow tasks/pods. We have over 5000 tasks in airflow. Each of these tasks spins up a pod when running. Setting resource requests and limits for these was a bit challenging. As we wanted to keep these requests & limits such that the majority of our tasks can run on the same resources and the resource utilisation is not impacted at the same time.

Reading the Memory Usage of Pods

We use Datadog to monitor the resources getting used by different pods. This data helped us a lot to finalise the value of memory requests and limits for the airflow tasks.

Fig. Datadog metrics to monitor pod resource usage

We collected the past month's data and tried to divide all the airflow tasks into four buckets namely: small, medium, large, and default with a target of keeping most of the items in the default bucket.

For starters, keep the requests as high as possible

Keeping high requests and limits in starting will make sure that your system will not break otherwise you might end up getting such errors on Airflow.

[2022-07-14, 23:02:53 IST] {local_task_job.py:154} INFO - Task exited with return code Negsignal.SIGKILL

Try keeping your requests and limits as close as possible.

Limits can be dangerous too. While setting up limits you will have to keep in consideration that if you set them too low the task might end up getting killed very frequently.

In doubt? Follow the 1.5x rule: We followed this rule to decide our request and limit i.e. the request should not be any less than two-thirds of the limit.

Keep on iterating until you find the ideal resource values

Setting high requests and limits may lead to a lot of resource wastage. For example, a task that was taking 100MB is given a resource of 300MB. This would lead to a waste of 200MB on the node. Therefore, the key is to keep on tuning these values until you land on the values which will not lead to resource wastage and high costs.

Assigning Resource Buckets to Airflow Tasks

Now it's time to set the limits for your airflow tasks. As we know some tasks take more resources than others and hence we divided them into different buckets.

We assigned the default resources in the worker-pod-config file itself. This assigned the resources for 80–90% of our tasks. Now for the remaining tasks executor_config parameter of the Airflow operator came to the rescue. We created an Executor Config Manager to return the executor config to the task based on its requirements. For example,

The Airflow executor config looks something like this,

Here you can specify your Kubernetes requests and limits for resources like CPU and Memory.

The executor config helped us to allocate resources to tasks more efficiently and cleanly.

It doesn’t end here…

Some other key things to keep in mind when autoscaling Kubernetes are:

Setting Requests and Limits is not a one-time task. You need to keep on tuning these values for better resource utilisation and to ensure nothing in the system breaks. Also, a task may show different resource usage on different days. We need to keep all this in mind while setting the values.

Proper Monitoring is the key to a stable system. We have set up proper monitoring on Datadog to keep in check the number of nodes running in the EKS cluster, Memory usage, CPU usage, and resources getting utilised by each task. Alerts when the cluster scales up/down.

Conclusion

In this article, we discussed how we scaled our Airflow Infra using cluster autoscaling in Kubernetes. Running Airflow on Kubernetes can look complex at first but once implemented it brings with it a lot of advantages in terms of flexibility in resource usage, easier scaling, better monitoring, and robust infra.

We believe that migrating Airflow to Kubernetes was worth the effort and we will keep on making the changes to infra as and when required to keep it rigid and reliable.

What Next?

We would love to hear from you, for any queries you can post your comments below👇 and show us some love ❤️.

Let’s always keep learning and growing!

Liked what you read? Interested in building large-scale infrastructure? Do check out our engineering openings @careers.playsimple.

--

--