Why We’re Switching Off Airflow — Sort Of

Vincent Castaneda
FLYR Engineering
Published in
6 min readMay 22, 2019

Context

At FLYR, we make heavy use of Apache Airflow for our ETL tasks as part of our data pipeline ingestion. Airflow provided us with a quick and easy way to run scheduled tasks, which satisfied our ETL needs. We have a large number of data pipeline and ETL jobs in Airflow that run on both data ingestion and data transformation tasks to keep our warehouse and pipelines running smoothly. Our ETL Airflow jobs run in a Kubernetes cluster with the Celery executor and manually-sized supporting Celery workers.

In addition to needing a robust ETL scheduler, we required a system that could complete many large computational jobs in parallel at scheduled intervals for our Machine Learning (ML) workloads. Parts of our travel demand forecasting involve computational tasks that are trivially parallelizable for independent dimensions, such as day of travel and market, allowing high degrees of parallelism that we can exploit. The output of these independent steps are combined in later optimization steps. These data science and ML workloads have a task sequencing and fork-join flow that is representable as a Directed Asynchronous Graph (DAG), which is the native Airflow model for representing ordered sequences of jobs. The system needed to be able to stably run hundreds of simultaneous tasks to completion for scheduled and ad-hoc computing.

Given our successful use of Airflow for ETL work, we were excited to use the scheduler for these ML workloads, but encountered numerous issues when using Airflow in a way it was not designed to function. Our ETL and ML DAGs differ in a significant way: the ML DAGs require a much larger degree of task parallelism in our ML workloads. When we initially decided to reuse our existing ETL Airflow system to fulfill the requirements for our parallel processing system, the extent to which the task parallelism would impact Airflow’s efficiency wasn’t fully evident.

After several attempts to adapt Airflow to our needs, we have decided to move away from Airflow to a scheduling system that can handle a high level of scalability. In this blog post, we will go over the solutions we attempted to implement and our thought process when deciding to move away from using a familiar system.

Limitations of Celery Workers

Our first attempt at tackling the problem was to use the tools we were most familiar with at the time, which was an Airflow deployment in Kubernetes using the Celery executor and underlying Celery workers. When a Celery executor is used with Airflow, a predefined number of Celery workers are created in the Kubernetes environment and wait for tasks to be scheduled by the scheduler. After receiving those tasks, the Celery worker either computes the tasks locally, or the Celery workers or executors act as monitoring agents that track the progress of tasks spun up on external Kubernetes pods.

This system has a major limitation. By design, all workers have to be preallocated before DAGs are run and tasks are assigned, which means the Celery executor has no simple auto-scaling option. This means that the number of spun up worker pods will always be the minimum number of pods required by the most computationally demanding DAG, even if most of the other DAGs would only require a fraction of the workers to perform properly. In addition, this uses costly computational resources for the large number of workers regardless of whether they are actually working on a task.

This lack of autoscaling resulted in heavy underutilization of our system on average. Because the system could never scale down to zero, we were burning money keeping resources allocated, which defeated one of the main purposes of running a system in Kubernetes.

Though our Celery setup fulfilled the parallelism and execution requirements for the tasks we wanted to run, and it was great for spinning up work quickly, we recognized that it was not a viable long-term solution as we continued to scale.

Using the Kubernetes Executor

One potential solution to our problems was the promise of the Kubernetes executor being released for Airflow v1.10.3. Unlike the Celery executor, the Kubernetes executor doesn’t create worker pods until they are needed. When Airflow schedules tasks from the DAG, a Kubernetes executor will either execute the task locally or spin up a KubernetesPodOperator to do the heavy lifting for each task.

This seemed great! When running small-scale tests, the scheduler and Kubernetes executor worked perfectly. During our proof of concept trials, the 5–10 parallel tasks would spin up and run to completion without incident and the scheduled DAGs succeeded, all while scaling down to zero upon task completion. We immediately went forward with tests for our system’s intended use cases.

Problems arose when we attempted to scale the Kubernetes executor. Unfortunately, the Kubernetes executor is young compared to the Celery executor and fails when creating hundreds of parallel tasks. We saw tasks fail with no obvious cause, Kubernetes executors report themselves as running for hours or days without completion, and database records drift from reality before our eyes. We scaled our database, attempted to restart the scheduler after a concrete number of scheduled tasks (a well-known solution within Airflow), and adjusted many different Airflow settings to try and find a viable solution, all to no avail. This was not a system that could handle the number of granular tasks we were throwing at it without a significant time investment into researching configurations and settings. It became clear to us that identifying the underlying issues would require more research for potentially little reward.

Before deciding whether the time investment would be worthwhile, we decided to rule out whether our code was causing the problem we were experiencing. We were scheduling time consuming tasks (7+ minutes) within the KubernetesPodOperator. Each of these tasks made several external calls to Redis, APIs, and other data sources before completion. We wanted to confirm that the failures we were seeing were not due to some strange configuration or internal performance issue within the code. To test this, we simply ran large parallelism tests with Airflow’s native BashOperator. The only command we gave these operators was to sleep for 5 minutes each. We saw the same errors occur. This ruled out the theory that our code was causing the scaling issues, and we found ourselves back at square one.

Given the resources and development time the Kubernetes executor would require to fulfill our needs, we decided to cut our losses short and try a different strategy.

Using a Native Kubernetes Scheduler

We found that we gravitated towards a solution many other companies had also independently arrived at. This solution is to use Airflow for the ETL tasks it was designed for, and to use a separate scheduler for larger, non-ETL processes.

As such, we are in the process of building a proof of concept system for our larger ML workloads using Argo as a native Kubernetes scheduling technology. Because using it does not result in complex interactions between the scheduler, executor, and database, Argo has allowed us to quickly prototype large parallel operations quickly and effectively. We have been able to spin up hundreds of simultaneous tasks and the deployment has been fault proof during cluster upgrades, allowing interrupted tasks to be unpaused seamlessly. The DAG definitions have also been as straightforward as they were in Airflow.

We plan to continue productization of this technology to meet our needs in the future and so far are very happy with the initial results.

Conclusions

FLYR has decided to switch off Airflow for the tasks that the scheduler is not specialized to run. We are preserving our existing investment in Airflow by keeping our implementation for ETL tasks that the system is suitable for, but are exploring other solutions such as the Argo to fulfill our ad-hoc computing requirements.

In fast-paced startup environments, it’s often easy to rely on tools that act as quick bandaids to problems rather than on solutions that require research and resources to implement initially, but that would be better in the long-term. At FLYR, we strive to overcome the limitations of legacy systems, and our engineers are encouraged to use the most appropriate tools for each job–rather than being limited to the systems we already have in place.

— — — — — — —

FLYR is always looking for more people to join our team! Take a look at the positions we are currently hiring for here.

--

--