How does China Unicom Upgrade and Adapt to Deal With 7W+ task Examples, 80+ Task Nodes By Apache DolphinScheduler?

8 min readOct 20, 2022

Hi everyone, I am Wu Liu from China Unicom, I am honored to be here at APACHECON ASIA 2022, and the topic I am sharing today is the application and practice of DolphinScheduler in China Unicom. My presentation will be divided into five parts as follows.

Background
Enhancements in ease of use
Lightweight requirements
3rd-party system integration
Introduction to the features of the Community Version 3.0

1 Background

The main reasons why we chose Apache DolphinScheduler are as follows. Firstly, the company currently has many big data dispatch tasks and data warehouse computing tasks such as Spark, Flink, and Shell-related tasks, as well as using commercial scheduling tools. Considering the current conditions of big data tasks and the aim of cost reduction, we thought of replacing the commercial scheduling tool with Apache DolphinScheduler and chose it for its high availability, scalability, and stability of the system.

Deploying the new Apache DolphinScheduler scheduling tool on the existing system posed a number of challenges.

The first was to unite existing scheduling servers and evaluate existing task deployments; the second was to evaluate the ease of migration of existing tasks; the third was to perform some re-development based on Apache DolphinScheduler in order to be compatible with commercial scheduling features, and the last is to combine the current production tasks with the continuous iteration of Apache DolphinScheduler versions.

We launched the first version in 2020, dealing with more than 10,000 total workflows and 70,000+ tasks, and more than 80 worker nodes. Re-development was carried out on system authority, node enhancement, and worker grouping, as well as differentiation and continuous upgrading in conjunction with the latest community releases.

2 Enhancements In Ease Of Use

In the statistics section of the system home page, we have improved the global data for the total number of tasks, the total number of workflows, the number of abnormal workflows, the total number of task nodes, and the number of abnormal task nodes, as well as ensuring real-time statistics for tasks within 24 hours.

In the statistical analysis module, the statistics of the Top10 projects of the task node instances are shown first, with each project showing the percentage of successful and unsuccessful operations, making it easy for the personnel to monitor. On the right-hand side, the workflow run time distribution for T+1 is shown.

In the real-time statistics section, we provide real-time statistics and presentation of task run data for each worker node. The framework uses StatsD+influxdb+granfnan to enhance the tracking of running task instances in both master and worker nodes. Specifically, the statistics interface from StatsD is integrated into the master and worker code, the service sends the statistics to Influxdb at runtime, Granfnan accesses the Influxdb data, and the final system UI interface accesses the Granfnan data through API to display the data.

By the real-time master and worker statistics, we can easily understand the task running load and dynamically adjust the worker group. The real-time statistics framework can also be used to further monitor master and worker memory, CPU, and network IO, making it easy for users to select servers for specific tasks.

We have also developed an online and offline service management module. As we have more than 80 servers online, it is difficult to manually manage the services, so we have developed a unified service management page to unify the management of master and worker services.

For master management, we use a status update mechanism, which includes the following operations:

Offline, loading the corresponding offline node address in zookeeper
Refresh, which checks the number of tasks and machine status
Restart, which means restarting the service
Rollback, where the service is rolled back to a previous version if there is a problem

The most important one is the offline service operation, which corresponds to the process of getting the list of online servers from Zookeeper, marking the offline service, comparing the list of online and marked offline servers, and performing subsequent operations on the list of offline servers, such as restarting, service upgrade, etc.

When the user clicks on the update, the service performs a process of marking offline operations and also determines the number of running tasks, which are triggered when they reach 0. The information about the service status change and the current version is also recorded. When the automatic service update fails, the service can be restarted or rolled back manually.

Worker management is similar to the master management module. It supports differentiation when upgrading worker versions and allows cross-version release of worker services.

Resource pool management is used to record version and time information and the location of upgrade packages in the database before each upgrade, and to distribute the resources to all our servers via the Netty service. At the same time, in order to detect any anomalies that occur during the upgrade process, an anomaly detection function is provided. This is achieved by automatically and manually triggering a scan of the database for the processing status of the corresponding project process.

3 Lightweight Requirements

01 Requirements for Lightweight Background

As many of our internal systems have the need for scheduling, we previously used lightweight scheduling tool components such as XXL-job to implement business scheduling logic.

We wanted to find out if we could differentiate between XXL-job’s lightweight framework and Apache DolphinScheduler’s framework, i.e. to integrate and unify the two frameworks.

The framework of XXL-job is outlined below.

02 Lightweight Requirements Framework Merging

We started the transformation process by registering the XXL-job executor module to the Apache DolphinScheduler database table when it started, and by communicating with the master via Netty; at the same time, it had to be compatible with the protocols used by the XXL-job service to respond to requests. The master monitor needs to distinguish between the XXL-job service registration and its own worker service when service registration occurs.

The master distribution of XXL-job tasks is consistent with normal task distribution, and the API layer can also read XXL task log-related information to unify feedback to Apache DolphinScheduler.

03 3rd-Party System Integration

The need for 3rd-party integration is similar to framework integration. We currently have a number of systems such as data quality and data collection that have their own internal scheduling systems for dispatching. In order to take advantage of the high availability features of Apache DolphinScheduler, by specifying uniform parameter standards and the format of the returned results, we can support the definition of workflow nodes for third-party extensions and implement third-party system integration components in the master.

During registration, the master will transmit the master address and zk address to the 3-party system, which will receive the parameters and write the scheduling-related parameters to the configuration hub.

On the dedicated process configuration page, you can select the data quality system to be linked and fill in the response parameters for saving purposes. More systems will be added later, including bi systems and metadata management systems.

04 Introduction To The Features Of Community Version 3.0

Version 3.0 has a number of improvements and enhancements compared to previous versions, such as a reworked UI that is more responsive and stable.

AWS Support
Service splitting
Introduced native support for data quality
Task groups, with the ability to control task concurrency and specify priorities within groups
Customized time zones
Task definition lists
New alert-type support
New Python API features

This version also includes Flink task types, supporting Flink SQL tasks, as well as support for Zepplin task types, Jupyter task types, and other feature enhancements and optimizations.

In conjunction with the new features in the latest version supporting Jupyter tasks, let me explain to you how to implement an extension to the Jupyter component.

First, configure the Jupyter execution command construct to the Jupyter plugin to implement a Jupyter parameter class for splicing shell commands. Perform parameterized checks in the Jupytertask class to implement the execution of shell commands.

We have modified the front-end page to support more execution modes, supporting docker, k8s options, and support for these execution modes in Jupytertask.

This is the introduction to Apache DolphinScheduler 3.0, thanks for your time, and that’s all I have to share today.

Join the Community

There are many ways to participate and contribute to the DolphinScheduler community, including:

Documents, translation, Q&A, tests, codes, articles, keynote speeches, etc.

We assume the first PR (document, code) to contribute to be simple and should be used to familiarize yourself with the submission process and community collaboration style.

So the community has compiled the following list of issues suitable for novices: https://github.com/apache/dolphinscheduler/issues/5689

List of non-newbie issues: https://github.com/apache/dolphinscheduler/issues?q=is%3Aopen+is%3Aissue+label%3A%22volunteer+wanted%22

How to contribute:

https://dolphinscheduler.apache.org/en-us/docs/dev/user_doc/contribute/join/contribute.html

GitHub Code Repository: https://github.com/apache/dolphinscheduler

Official Website:https://dolphinscheduler.apache.org/

Mail List:dev@dolphinscheduler@apache.org

Twitter:@DolphinSchedule

YouTube:https://www.youtube.com/channel/UCmrPmeE7dVqo8DYhSLHa0vA

Slack:https://s.apache.org/dolphinscheduler-slack

Contributor Guide:https://dolphinscheduler.apache.org/en-us/community/index.html

Your Star for the project is important, don’t hesitate to lighten a Star for Apache DolphinScheduler ❤️