Adobe Experience Platform Insights on Achieving High Scale Using Apache Airflow
Many enterprises are challenged with the need to expand their systems to handle more and increasingly complex workflows while delivering expected high availability and performance. For example, the scale is often a limiting factor for companies seeking to capitalize on the benefits of machine learning, for which the predictive models often require large data sets and scale-out capabilities of a cluster to train.
At Adobe Experience Platform, our developers continue to explore new ways to achieve high scale with low latency while supporting an ever-growing number of concurrent workflows to ensure the best customer experience. This post will describe how Adobe has accomplished this with the use of Apache Airflow and Kubernetes (K8S) and provide a model to help developers exponentially increase the scale of their workflows to improve performance while giving them the ability to manage latency at high scale.
Our objective: Achieve high scale while reducing latency and processing time
Internally, Adobe provides its platform orchestration service to our development teams to simplify the authoring and management of multi-step workflows (DAG). This service that provides RESTful JSON based APIs for workflow creation including triggers had a couple of key requirements:
- Support thousands of concurrently running workflow tasks per client.
- Have bounded scheduling latency within some predefined threshold.
Our workflow service is built on top of Apache Airflow, which allows us to orchestrate complex computational workflows and data processing pipelines. Apache Airflow, which is based on Python, is abstracted from our end users so they don’t need to understand Airflow’s constructs or its functionalities.
Prior to implementing Apache Airflow for our workflow service, we did some initial benchmarking, which helped us to identify a couple of challenges we would face in trying to use it to achieve a high scale:
- Apache Airflow tasks are resource-intensive
- Workflow scheduling latency increases as the number of workflows grow
In one of our tests, we ran about 90 concurrent tasks with 16 GB and eight-core machines with Apache Airflow’s Local Executor mode. We also observed that Airflow scheduler processes workflow files (DAG) in a long-running loop.
To calculate the approximate time required by the scheduler to process all workflow files (DAG), we used the following equation:
- ((number of DAGs)/maximum threads) x (scheduler loop time) = total processing time
In our tests, we found that for 3,200 DAGs and 8 maximum threads and a scheduler loop time of approximately 1.1 seconds, the scheduler takes about 440 seconds (7.5 minutes) to run.
These results clearly illustrated that a single instance of Apache Airflow wouldn’t be capable of meeting our requirements for scale.
Adding support for multiple Apache Airflow clusters
To meet our requirements we added support for multiple Apache Airflow clusters to our orchestration service. As before, we used our platform orchestration service to provide an abstraction over Apache Airflow and employed sharing logic to partition clients workflows across multiple Airflow clusters. We also enforced all the necessary constraints to keep workflow scheduling latency within our predefined threshold.
In this model, each Apache Airflow setup is an independent cluster with its own metastore, server, scheduler, DAG store, etc. With this design, we are able to scale horizontally, easily adding clusters to scale up as needed. With a single-cluster architecture, we can run hundreds of concurrent tasks. With multi-cluster, we can run thousands.
Migrating our workflows to Kubernetes Executor
When Apache Airflow released version 1.10 with support for K8S Executor, we decided to explore how we might be able to use K8S Executor with Airflow to achieve high scale. K8S Executor creates one K8S worker pod for each DAG task. This one-to-one mapping between worker pods and their corresponding DAG tasks, allows us to scale worker pods horizontally within our workflow orchestration.
Our initial benchmarking results indicated that we could not only run 100s of concurrent tasks but also achieve better resource utilization through autoscaling on a K8S cluster. Based on these results, we shifted our workload from our existing systems to K8S, migrating our Airflow clusters one-by-one from Local Executor Mode to K8S Executor Mode.
To facilitate this migration, we had to upgrade Apache Airflow from version 1.9 to version 1.10.2 and added the following enhancements to Apache Airflow’s K8S executor:
We contributed a few patches to Apache Airflow Kubernetes executor. Below are the details:
- Airflow-3516: Improvement in K8S pod creation Rate
- Airflow-3917: Support to run scheduler outside of K8S cluster.
- Airflow 4598: Fix for Task Retries handling in K8S Executor
- Airflow 4218: Added support to pass HTTP args to K8S Executor.
- Airflow 4123: Improved exception handling in K8S Executor
In addition, migrating our Airflow clusters one at a time meant we had Airflow clusters running on different versions simultaneously. So, we also had to enhance our workflow service orchestration to support heterogeneous Airflow clusters.
Adding support for rescheduling mode
We run a variety of tasks as a workflow service provider. The majority of these tasks are based on HTTP operators where the job is submitted to the downstream service and must wait for its completion such as a compute task to submit spark job to compute engine. Once the job is submitted these tasks, it keeps on polling the job status and mark themselves as success or failed based on job result. The lifecycle of these tasks can vary from a few minutes to a few days.
With version 1.10.2, Apache Airflow now supports “reschedule” mode for its tasks, which allows you to schedule a preemptive task to move a long-running process into a reschedule state to help reduce the resources (compute) they require. In reschedule mode, a running task can be preempted and moved into an “up-for-reschedule” state (as opposed to an “up-for-retry” state). This causes the task to release its scheduling slot which can then be used by the scheduler to run other tasks. After the expiration of the rescheduled time, the “up-for-reschedule” task is moved back into a running state by the scheduler. By enhancing our Apache Airflow operators to support “reschedule” mode, we were able to run approximately 800 tasks with only 50 slots.
Our use cases
At this time, most of the divisions within Adobe that were previously using our platform orchestration services have now been migrated to K8S and are seeing improved performance at a higher scale. We are currently exploring other use cases within Adobe to improve the services we provide our clients through Adobe Experience Platform. Below are a few examples of how we have implemented this solution:
- Data Science Workspace: With the complex and iterative nature of the predictive models used in machine learning and the ever-increasing volumes of data required to ensure accurate results, machine learning implementations require an architecture capable of flexibly and efficiently supporting high scale data flows.
- Query Service: Adobe Experience Platform Query Service was a natural fit for this solution. With Adobe Experience Platform’s innovations, it was clear that a scalable solution would be needed to handle the ever-increasing number of queries received by the system while still providing the best customer experience.
- Maintaining Compliance: Complying with GDPR requirements involve several steps and conditional logic dependencies, which together results in thousands of requests to an Adobe Experience Platform site at any given time. In order to handle these requests, we need a scalable solution to support not only the load but the conditional logic, which introduces a significant level of complexity to the workflow.
A few remaining challenges
Our developers continue to work on this model. Keeping scheduling latency, the gap between the defined solution and the execution of that solution, within an acceptable threshold remains a challenge.
Another challenge is that while platform orchestration service uses abstraction to eliminate the need for end-users to know Apache Airflow, this abstraction can also make it difficult for end-users to identify and isolate the problems when a workflow fails because the logs become intermixed. We are continuing to explore how we might be able to provide users a way to do this kind of isolation. In the meantime, to help address this now, we provide the pre-constructed logs endpoint in our platform orchestration service, which helps our users fetch their task-only logs.
And of course, there is always room for more optimization. K8S is a distributed service, and every workflow and workflow task is a heartbeat. While you can have thousands of tasks running at one time, each one of those tasks creates a MySQL connection, which can result in scale problems on the MySQL side, such as connections getting dropped, MySQL crashing, etc. So, we are evaluating the relative benefits that relational database management systems such as MySQL or PostgreSQL may provide to better manage task heartbeat.
Added benefits for the developer community
We believe in giving back to the open source community. As a result of our work with Apache Airflow and K8S, we were able to identify and report back to the Airflow community five new bugs and provided eight improvements. We also found a number of scale-related issues with Azure’s infrastructure, including four new bugs and provided 19 feature requests which were fixed or implemented after we communicated them to Microsoft. While Adobe’s goals for this work are driven by a desire to better serve our customers, our efforts to achieve high scale supports continuous improvement for the benefit of all developers using open source tools to meet increasingly complex computing needs.
Prior to adding K8S to our workflow service, we started with a scale of approximately 80 concurrent tasks. Now, with these changes, we can run thousands — and with better control over scheduling latency.
To date, we have integrated this model with four Adobe data services, which combined, have more than 3,750 workflows in production, more than 1,100 of which are recurring workflows, running more than 10,000 executions per week. Using our platform orchestration service in this way provides a better experience for our clients, allowing us to work at a high scale to do more in less time with fewer resources.
Follow the Adobe Tech Blog for more developer stories and resources, and check out Adobe Developers on Twitter for the latest news and developer products. Sign up here for future Adobe Experience Platform Meetups.