Meet MaaT: Alibaba’s DAG-based Distributed Task Scheduler
Learn how Alibaba is ensuring cross-platform efficiency with a new kind of enabling platform
This article is part of the Search AIOps mini series.
As an elaborate ecosystem of services and platforms, the Alibaba Group’s network architecture presents operation challenges uniquely its own. In such an environment, performing specific functions asynchronously and in keeping with specified procedures requires deploying multiple cooperative subsystems capable of operating in an extremely deep network structure. For example, the go-live procedure of an application needs to call the configuration synchronization module, monitor module, resource update module, smoke module, and engine creation module successively, which in turn necessitates branch judgment, context transfer, and failure retry throughout the procedure.
To meet these needs, the Alibaba Group has recently invested in the development of enabling platforms designed to support complex services where a single system will not suffice. Specifically, Alibaba has turned to its own MaaT as a tool for centrally managing numerous procedure-oriented tasks while maintaining each task node in its own container. This enables tasks to run in a distributed manner, ensuring that procedures run with steady efficiency.
Today we look more closely at the architecture of MaaT and its core related components to provide a general introduction to its principles and mechanism.
What is MaaT?
MaaT is a procedure dispatching system based on the open source project Airflow, which allows users to customize flow nodes. With it, procedures can be set to go off at a user-specified time (supporting crontab format) or triggered by the user manually.
In MaaT, all nodes run on Hippo in a distributed manner and are dispatched by Drogo. MaaT allows users to create their own dispatch nodes and execution nodes in order to achieve resource isolation and to configure the operating environment or the number of copies of their own execution nodes.
The following figure shows an example of a dispatching procedure for an individual task:
Why Deploy MaaT?
Project development often presents a need for procedure-oriented and timed dispatches, such as go-live procedures or timed analysis of tasks. However, attempts to develop a procedure dispatching system and to access the group’s workflow inevitably present a number of problems.
First, the service code and dispatch code are heavily coupled. Modifying the procedure requires intrusion into the code level, where the pub of the service code impacts dispatches. Second, it is difficult to manage and trace these dispatching tasks without a unified management system. Lastly, complex procedures with multiple branches and context transfer scenarios are not well supported, and the lack of visual UI makes it unfriendly to users.
Technical Model Selection
The dispatch of timed tasks and procedure tasks is a common requirement in products from within the Alibaba group such as D2 and workflow, as well as open source products like Airflow and Quartz.
D2 is a set of procedure dispatching systems based on ODPS that carries the dispatching tasks generated by the group on the basis of ODPS data output. It allows users to customize scripts, timed task triggering, and manual triggering (ways of supplementing operation), and is suitable for task procedure dispatches based on data states (such as executing task procedures based on data output). It is specially maintained by Alibaba’s D2 team and features a well-designed UI.
Nevertheless, D2 has a number of shortcomings. DAG (directed acyclic graph) dispatch takes place on a broad scale, and each node and topological relationship that needs to run each day is calculated based on the previous day’s global topological relationships. Thus, in theory, the newly created and modified task can only take effect on the next day and would need to be done in a compensation-operation manner in order to take effect immediately. There are often changes to tasks (such as task configuration or dispatching time) in the services, or scenarios involving manual triggering of dispatches (such as the pub procedure or full back up procedure, which may occur at any time). Using D2 makes services inflexible, and under the circumstances exceeds the applicable scenario range for D2. Further, D2 does not support the transfer of the procedure context, while context transfer is relatively robust in relevant services. Often, the previous node outputs a certain value, and the next node needs to use it.
Lastly, D2 lacks support for the search ecosystem. The entire underlying architecture of the Search Technology department has its own set of ecosystems, such as dispatch (comprised of Hippo and Drago) and alarm (consisting in Kmon). With D2, users cannot fully enjoy the benefits of the search technology ecosystem, while it meanwhile causes problems for subsequent unitized deployment.
Group workflow is a general dispatching engine for the group approval procedure. The approval procedure for many products is based on group workflows, and it can also be used as a simple task dispatching procedure. Alibaba has also used group workflow as a dispatching engine for procedure tasks prior to the deployment of MaaT. It allows manual triggering, supports calling external systems in HSF mode, and sustains context transfer. However, it involves complicated configuration and has limited support for calling external systems.
Quartz is a Java-based open source dispatching framework. It supports distributed dispatches, task persistence, and timed tasks, but does not support procedure dispatches. Further, task configuration needs to be coupled in the dispatch system, while the hot loading of tasks needs modification.
Airflow, an open source project, is a distributed procedure dispatching system presenting numerous advantages. With it, the service code is decoupled from the dispatching system, and the procedure code for each service is described by a separate Python script, which defines procedure-oriented nodes to execute the service logic and to support the hot loading of tasks. With this, the crontab timed task format is fully supported and can be used to specify when a task is done. It supports complex branch conditions and sets a trigger time for each node, such as executing when the parent nodes all succeed, or when any parent node succeeds. It also offers a complete UI that visualizes the status and history of all tasks, and relies solely on DB and rabbitmq with fewer external dependencies, making it easier to build.
Some questions have been raised about the comparison between Luigi and Airflow. Both are based on a task dispatching system of pipline with similar functions. As a latecomer, Airflow surpasses early starters and competitors.
The following is a comparison of similar products:
After a period of research, Alibaba chose Airflow as the prototype from which to develop a distributed task dispatching system. With comprehensive functions, it meets basic service needs and can expand functions with comparative ease. Airflow is less externally dependent, making it easier to connect with the search ecosystem.
Issues with native Airflow
Airflow can solve many of the problems present in procedure dispatch, but the direct implementation of native Airflow for production nevertheless presents a number of problems.
Native Airflow supports distributed dispatches, but it cannot be deployed directly on Drogo due to its dependency on the local state. Its lack of a proper means of monitoring it requires combining it with Kmon to improve monitoring and alarm facilities, and without a user-friendly approach to editing, users need to know more about the principles and details of Airflow to operate it successfully. When a large number of tasks are running, the performance of the dispatch drops dramatically. Lastly, native Airflow presents some bugs in distributed mode.
The following image provides a high-level view of MaaT’s architecture:
Applications can be created with MaaT based on any procedure dispatch or timed triggering requirements. MaaT provides visual editing pages and abundant APIs. Users can easily create editing procedure templates and set up complex branch logic, while MaaT will determine the flow path for the procedure according to the state of the runtime during dispatches.
Application scenarios currently connected to MaaT include Tisplus, Hawkeye, Kmon, capacity platform, offline component platform, and Opensearch.
Management in native Airflow is relatively simple, as it is based on the Python script dispatch that describes the task procedure DAG. Thus, users need to learn about Airflow principles to create, update, and execute tasks. Maintenance can only be done based on files, which is quite inconvenient. For this reason, MaaT incorporates a management system layer in its outer layer — MaaT Console — to reduce the costs of operation, maintenance, and user learning.
The following figure shows the operation interface for MaaT’s management system Aflow:
In task procedure dispatch scenarios, it is common for procedures for executing different tasks to be more or less the same, with only individual parameters differing. Therefore, MaaT introduces a task procedure based on template management. The user defines a running template for the procedure in the template management layer and uses variables to represent the undetermined portion. When a specific task is generated, it is rendered by specific variables and templates. When a template is modified, it can be validated for all tasks that depend on it.
Template management presets several task nodes, and users can freely choose different task nodes to assemble the template procedure.
Application management is used to manage all specific procedure dispatching tasks, including the templates used by tasks, the values of variables, alarm information, and the task-triggered crontab. After creating the application through the template, the user can continue to maintain the running of the tasks through application management.
Tasks running on MaaT belong to different applications, and the operating environments for different applications vary widely. Moreover, different applications will ideally achieve cluster isolation. To do this, MaaT provides management of the queue. As a result, task nodes for the specified queue are dispatched to machines of the corresponding queue, which will only run task nodes of the specified queue.
In addition, the queue can also specify concurrency, indicating how many tasks are running at the same time on the current queue to ensure that tasks running simultaneously on the machine will not cause excessive load and that tasks beyond concurrency will be suspended until the resources are released.
The MaaT core module completes the entire procedure of task scheduling. Each node of the core module runs independently on the machine, without mutual dependence at startup. The data state is saved via DB and the messages are distributed through MQ or FaaS.
Web API service
Web API service provides a wealth of API with external interaction, including task addition, deletion and modification, historical task status, task status modification, task triggering, and task retry.
Further, it also completes the web display function provided by native Airflow.
The scheduler plays a key role in MaaT, determining when a task is dispatched to run and which nodes can be executed when running a task. The node determined for execution is sent to the worker by the scheduler via MQ or FaaS.
As the number of tasks increases, the load for a single scheduler gradually becomes too high and causes the dispatching period to extend. To alleviate pressure on the scheduler, MaaT splits the scheduler according to services. Tasks for different services are dispatched by an independent scheduler and sent to the designated worker.
Native Airflow has a low dispatching logic throughput. When the number of tasks increases, the dispatch period will become quite long, and the scheduler faced with a high number of tasks will delay by about one minute. Referring to the latest implementation, we have optimized the native dispatch logic, split the previously blocked dispatching method into multiple process pools, and completed the production > commit > polling operations of the executable tasks asynchronously. Following pressure testing, the original dispatch period of 30 to 40 seconds was reduced to about 5 seconds.
Worker is a role that executes tasks. It accepts the task issued by the scheduler and executes the specific tasks described in the node onto the worker. The worker is usually deployed with copies, and the task is on any equivalent worker machine. When worker resources are insufficient, the task can be dynamically expanded.
Differing basic environments are required by the diverse range of queue tasks, such as Python, Java, Hadoop, and zk, and there are differences in the configuration of the startup parameters for the worker roles in different queues. Therefore, when the workers of different queues start, they will be deployed and installed according to the resources described in the configuration.
After the task is completed on the worker, DB will be written back. The scheduler will continue to dispatch the task after detecting changes in the current task status.
The task dispatch layer sends tasks that the scheduler must schedule to a specified worker. MaaT uses both native Celery+Rabbitmq and search ecology-based FaaS to dispatch tasks.
Celery + RabbitMQ
Native Airflow uses Celery + RabbitMQ to dispatch messages from scheduler to worker.
Scheduler sends to-run tasks to MQ, which includes the corresponding queue information for the tasks. When the worker gets information from MQ, only corresponding queue tasks are obtained and pulled to the corresponding workers for execution. MQ is realized using RabbitMQ in MaaT. Similar to other roles, MQ is also independently deployed.
The Celery + Rabbitmq model continues tasks and task status in the MQ. The performance of memory queues can satisfy demand in most scenarios. However, MaaT deployment is based on two-layer dispatcher Drogo and requires all deployment nodes to be stateless. MQ fails this requirement, as it stores message status. This is why we have chosen the search ecology-based FaaS frame to replace Celery + RabbitMQ.
FaaS (Function as a Service) is a ServerLess frame realized on the basis of the search ecology, with MaaT as its executor.
All MaaT tasks are abstracted into functions. When a task is executed, MaaT calls the corresponding function; after the task is completed, MaaT returns to the task status, at which point initial linkage between MaaT and FaaS is complete. In the future, further optimization is expected based on FaaS. For instance, diversified ways of executing tasks can transform lightweight tasks into functions, and key tasks into services; dynamic adjustment of task resources even allows for resource distribution to tasks currently being executed, and for the immediate release of such resources once execution completes.
For MaaT, FaaS supports task distribution from producers to consumers, building messages in MQ, and providing task status interfaces. Additionally, FaaS itself ensures that messages are not lost, and is able to automatically expand and contract according to consumer load.
DB, which uses group IDB, is responsible for the longevity of MaaT information, including task information, task running records, task running status, node running records, node running status, and more.
OSS is a component that manages the risk of machine migration in launching Drogo, as no logs can be stored locally. The running logs of all nodes are stored on OSS, and viewing logs requires accessing OSS.
Kmon is responsible for monitoring the running status for clusters and sounding alarms upon task failure.
Lastly, Drogo completes dockerized deployment of all MaaT nodes.
Advantages of the MaaT Platform
Visual edit and universal node types
MaaT offers Aflow, a management platform that enables users to conveniently edit flow nodes and manage all templates and tasks, as referenced previously in the Management Layer section of this article.
In addition, MaaT also provides diversified types of universal nodes. Native Airflow supports a variety of nodes that can execute different types of tasks. In interactions with users, MaaT can seal nodes according to users’ habits and demands and also develop new node types to meet most users’ needs, as follows below.
Bash nodes are nodes that directly execute basic bash operations on workers. Bash nodes usually rely on other resources and are thus used infrequently.
Http nodes support http calls. In scheduling, http requests are sent to trigger other systems. These nodes also provide polled http interfaces. When triggered, they poll whether other systems have run successfully, continuing to run only if they have done so.
Bash nodes with resources, different from ordinary bash nodes, come with some resources (such as jar packages, bash scripts, and Python scripts). Running these nodes first downloads resources locally and then executes bash scripts.
Branch nodes decide which branches’ divided nodes go through based on the previous node running results or initially transferred parameters.
MaaT service provides multiple roles, all requesting different running environments. Maintaining these environments can become a nightmare for operating staff, in which case launching Hippo becomes the best choice for MaaT operations. As a management platform based on two-layer dispatch service, Drogo makes it possible to deploy all MaaT nodes to Hippo.
Specifically, Drogo-based deployment has the following advantages.
Drogo supports adding new nodes at a low cost. Before Drogo, resources needed to be prepared for running each time new nodes were demanded, and deployment scripts also needed to be written. Moreover, the operation teams themselves have to maintain the scripts for each node. After launching Drogo, all such deployment information could be stored on Drogo. As new nodes come in, copying and tweaking similar information from previous deployments is enough to get the job done.
Drogo offers easy expansion. Before Drogo, expansion of high-level tasks required preparing machines, setting up the environment, and debugging running parameters, which altogether could take a half to a whole day. After Drogo was launched, adjusting the number of copies was enough to automatically expand those tasks.
Drogo also effectively prevents service disturbances caused by machine migration. Before launching Drogo, expansion had to be done dependently on another machine once the original machine became faulty. For some nodes that run just one machine, operators could do little more than pray their machines would hold out. By contrast, Drogo automatically assigns a new machine to take over service when a machine migrates. For services that can be disturbed, there is no concern that services might become unavailable after the machine fails, even in single-node deployment scenarios.
The following figure shows MaaT roles currently deployed on Drogo:
Since some nodes in native Airflow are stateful and rely on part of local files, machine migration will make these nodes stateless. MaaT was thus tweaked to ensure machine migration would not lead to this result as follows.
Whereas previous scheduling relied on local Python DAG files that would be lost due to machine migration, the change was made to store all DAG files in DB with the saved information scheduling, ensuring that DAG information would not be lost after machine migration.
Due to their reliance on local files, both the web service and the scheduler read and write the same DAG files. As a result, native Airflow’s roles scheduler and web service must be bound together to run. Scheduling with DB information instead allows the web service and scheduler to be deployed separately.
Lastly, since the original log files are stored locally, machine migration can result in their loss. After reconstruction, log files were stored in the far end of OSS, where they will be read each time.
Management by cluster
To isolate different tasks, MaaT expands on the cluster management functions of Airflow’s native queue management for various tasks. Some types of tasks can create their own schedulers and workers, and when creating an application, a specified scheduler can be used to schedule or run on a specified worker. (If none is specified, the default scheduler and worker will be responsible for scheduling.)
Cluster configuration parameters include worker deployment configuration, the number of workers, cluster concurrency, and the scheduler.
Worker deployment configuration pertains to the resources that this worker relies on. When Drogo starts, resources required while tasks are running will be deployed to the worker machine. When the machine migrates, this deployment configuration will be used to redistribute resources.
The number of workers parameter simply controls the number of worker roles.
Cluster concurrency controls the number of concurrencies running in the cluster to prevent overburdening downstream systems due to too many tasks being run.
Currently, each cluster has only one scheduler, though subsequent rebuilds will support multiple scheduler nodes.
Monitors and Alarms
Platform monitor and alarm
To gather a sense of how the platform is running, MaaT reports metrics to Kmon at every critical step of each node. Development staff will get alarms about abnormal metrics in time. Such metrics can also be used to judge load on the current clusters, and optimize nodes with high loads.
For specific tasks, MaaT supports alarms for any abnormal running of task nodes, such as abnormal node running, failure to run tasks on time, and task time-out, among others. Users can set up alarm conditions and alarm receivers on the management platform.
Current State of the Platform
MaaT is a DAG-based task scheduling system that serves internal Alibaba Group services, as well as numerous cloud application scenarios.
For search-enabling platform Tisplus, it schedules online procedures for Tisplus and other tasks that need timed triggering.
For Hawkeye, it schedules analysis tasks on a timed basis.
For Kmon, a search monitoring platform, MaaT supports monitoring and hosting services and alarm handling procedures.
Torch, a search capacity estimation platform, relies on MaaT for management of capacity estimation procedures.
In Bahamut, an offline search platform, MaaT supports release procedures and full back up procedures for offline component platforms.
In Opensearch it supports offline tasks for a number of algorithm scenarios, and in Tpp it recommends procedure scheduling tasks for recommended scenarios.
MaaT online cluster tasks execution
As of August 13, 2018, MaaT schedules an average of over 3000 tasks daily and runs an average of more than 24,000 tasks daily.
As more application scenarios are switched into MaaT, platform capacity will be evaluated further under stricter test conditions.
With more businesses and data onboard, MaaT will also need further enhancement to its user experience and reliability. This should include further combination with Aflow for one-stop creation, configuration, and deployment of clusters on the management platform. It will also need to provide more alarm options as part of a stronger error reporting mechanism. Beyond this, some scheduling defects will arise as tasks increase, requiring further optimization, as well as in-depth cooperation with FaaS to create separate services for various tasks and to reduce resource consumption.