System Outage?! DolphinScheduler High Availability and Failover Mechanism to the Rescue

Published in

ILLUMINATION’S MIRROR

5 min readMay 19, 2023

Written by Ruan Wenjun
High availability is one of the key features of Apache DolphinScheduler. It avoids single-point failures through redundancy, and all components naturally support horizontal scaling. However, redundancy alone is not enough. When a node in the system goes down, there needs to be a failover mechanism that can automatically transfer the workload from the failed node to a new node to achieve high availability.

DolphinScheduler Architecture Overview

Apache DolphinScheduler is a distributed and scalable workflow orchestration and scheduling system. Its core architecture consists of three main components: APIServer, Master, and Worker.

The APIServer receives all user operation requests, the Master handles workflow orchestration and scheduling, and the Worker executes tasks within the workflow. The entire system utilizes service discovery through a registration center and persists metadata in a database.

The lifecycle of a workflow execution is as follows: Created in the APIServer and metadata is persisted in the database. A command to trigger workflow execution is generated manually or scheduled, and written to the database. The Master consumes commands from the database, executes the workflow, and dispatches tasks to the Worker for execution. Once the entire workflow execution is complete, the Master ends the workflow execution.

DolphinScheduler Cluster High Availability

High availability (HA) is a crucial consideration in distributed systems. HA refers to the system’s ability to provide services for a significant amount of time, with only a minimal percentage of time being unavailable due to failures.

To ensure high availability, a design principle in the architecture is to avoid single points of failure through redundancy. A single point of failure occurs when a component of the system has a single instance, and the failure of that instance renders the entire system unavailable.

Apache DolphinScheduler addresses this by employing redundancy. In DolphinScheduler, all components naturally support horizontal scaling.

API-Server High Availability

For the API-Server, since it is a stateless service, ensuring high availability is straightforward. By deploying multiple instances of the API-Server and registering them under the same gateway, they can collectively provide services externally.

Master High Availability

The Master is the core component in DolphinScheduler responsible for workflow processing. Its availability directly impacts the overall system stability.

Unlike the API-Server, the Master actively consumes workflows from the database instead of passively receiving external requests. At any given moment, only one Master can process a workflow. Therefore, when horizontally scaling the Master, additional considerations arise.

A simple solution is to adopt an active-standby approach. Multiple instances of the Master service are deployed, but only one instance is active and handling requests externally. The remaining Master instances remain in standby mode. In the event of the active Master’s failure, a new active Master is elected from the standby instances to take over.

While this approach is simple to implement and effectively resolves the single point of failure issue, it limits the workflow processing throughput of the entire cluster since only one Master can work at a time. In DolphinScheduler, this would impact the scheduling of workflows.

To address this, DolphinScheduler utilizes sharding to evenly distribute the workflow metadata across all Masters. Specifically, the commands generated by workflows are evenly distributed among all Masters based on their IDs, allowing all Masters to work simultaneously without interfering with each other.

Master uses the registration center to monitor the status of other centralized Master’s nodes. However, due to potential inconsistencies in the timing of metadata updates when nodes go online or offline, additional safeguards are implemented using database transactions. This ensures that the notification of metadata changes from the Master to all other Masters is consistent and guarantees that the same Command is processed only once.

Worker High Availability

Worker, as a task execution component in DolphinScheduler, is easily scalable. Its design allows it to passively receive tasks distributed by the Master, without actively fetching tasks from the database. Therefore, after horizontal scaling, the Worker only needs to register with the registration center. The Master will then be able to perceive the metadata changes of the Worker through the registration center.

Failover Implementation in DolphinScheduler

Redundancy alone is not sufficient to ensure system reliability. In the event of a node failure in the system, it is essential to have a failover mechanism that can automatically transfer the ongoing tasks from the failed node to a new node for execution. In DolphinScheduler, all failover operations are handled by the Master.

The Master monitors the health status of all Masters and Workers in the registration center. When a node goes offline, all Masters receive the event notification about the node’s failure and subsequently execute the fault tolerance logic.

The determination of which Master performs the failover operation is achieved through a competition using distributed locks.

During the execution of the failover operation, different fault tolerance strategies are employed depending on the type of component, either Master or Worker. In the case of a Master failover, all surviving Masters compete for the distributed lock to decide which one will perform the failover operation. The Master that acquires the lock will query the database to retrieve the running workflow instances on the failed node and generate failover requests for them. In the case of Worker failover, all Masters check if there are any tasks running on the specific Worker in memory, and if so, they trigger the task failover logic.

A particular scenario occurs when all Masters in the cluster experience failures, leaving no Master available to execute failover logic. However, when the cluster recovers and Masters are restarted, they will perform the failover logic during the Master startup process.

Join the Community

There are many ways to participate and contribute to the DolphinScheduler community, including:

Documents, translation, Q&A, tests, codes, articles, keynote speeches, etc.

We assume the first PR (document, code) to contribute to be simple and should be used to familiarize yourself with the submission process and community collaboration style.

So the community has compiled the following list of issues suitable for novices: https://github.com/apache/dolphinscheduler/contribute

List of non-newbie issues:

https://github.com/apache/dolphinscheduler/issues?q=is%3Aopen+is%3Aissue+label%3A%22help+wanted%22+

How to contribute:

Https://github.com/apache/dolphinscheduler/blob/8944fdc62295883b0fa46b137ba8aee4fde9711a/docs/docs/en/contribute/join/contribute.md

GitHub Code Repository: https://github.com/apache/dolphinscheduler

Official Website:https://dolphinscheduler.apache.org/

Mail List:dev@dolphinscheduler@apache.org

Twitter:@DolphinSchedule

YouTube:https://www.youtube.com/@apachedolphinscheduler

Slack:https://s.apache.org/dolphinscheduler-slack

Contributor Guide:https://dolphinscheduler.apache.org/en-us/community/index.html

Your Star for the project is essential, don’t hesitate to lighten a Star for Apache DolphinScheduler ❤️