Trino swimming with the DolphinScheduler

Apache DolphinScheduler
CodeX
Published in
5 min readFeb 28, 2023

https://www.youtube.com/watch?v=96ghGpnQ7IQ

Video sections

Hosts

Guests

What is workflow orchestration?

Workflow orchestration refers to the process of coordinating and automating complex sequences of operations known as workflows, consisting of multiple interdependent tasks. This involves designing and defining the workflow, scheduling and executing the tasks, monitoring the progress and outcomes, and handling any errors or exceptions that may arise. In the context of Trino, the tasks are typically the processing of SQL queries on one or more Trino cluster and other related systems to create a data pipeline or similar automation.

Why do we need a workflow orchestration tool for building a data lake?

Building a data lake can involve many complex and interdependent data processing tasks, which can be challenging to manage and scale without a workflow orchestration tool. Sometimes we can consider tools like Trino at the center of the universe, and perhaps it would be easier to schedule SQL queries with a much simpler tool. Most companies, however, require a larger variety of tasks to build a data lake that interoperates on more than just running SQL on Trino. Even if you primarily run Trino SQL scripts to run these jobs, it is better to have an orchestration tool instead of managing all processes manually.

What is Apache DolphinScheduler?

Apache DolphinScheduler is an open-source, distributed workflow scheduling platform designed to manage and execute batch jobs, data pipelines, and ETL processes. DolphinScheduler enables users to create and manage consecutive jobs that run easily, including support for different types of tasks, such as SQL statements, shell scripts, Spark jobs, Kubernetes deployments, and many others. In short, it’s a powerful and user-friendly workflow orchestration platform that enables users to automate and manage their complex data processing tasks.

Read this blog on Trino and Apache DolphinScheduler to find out more.

Does DolphinScheduler have any computing engine or storage layer?

DolphinScheduler is a powerful tool for managing and orchestrating data processing workflows across a range of computing engines and storage systems, but it does not provide its own computing or storage capabilities.

What are the differences from other workflow orchestration systems?

Both Dolphin Scheduler and Airflow are designed to be scalable and highly available to support large-scale distributed environments.

Airflow supports a wide range of third-party integrations, including popular data processing frameworks such as Trino, Spark, and Flink, as well as cloud services such as AWS and Google Cloud. Dolphin Scheduler supports a similar range of data processing frameworks and tools. This makes both platforms suitable for managing diverse data processing tasks.

DolphinScheduler project believes that future data governance belongs to data engineers and consumers alike and should not be centralized to a single team. Product-focused engineering teams should have access to data and be able to orchestrate workflows without the need for extensive coding skills. DolphinScheduler uses a drag-and-drop web UI to create and manages workflows, while also providing programmatic access using tools like Python SDK and Open API.

A positive feature of DolphinScheduler supporting users outside the data team through a UI is that it offers robust security features. This includes authentication, authorization, and data encryption, to ensure that users’ data and workflows are protected.

How does DolphinScheduler deal with failures?

Failure is an inevitable aspect of data workflow orchestration. The merits of many of these orchestration tools come from how well they aid users in responding to failures by monitoring health and notifying users when things go wrong.

Does DolphinScheduler have an alarm mechanism itself?

Apache DolphinScheduler supports user notifications as part of a workflow. This mechanism is designed to help users monitor and manage their workflows more effectively and respond quickly to any issues.

These alerts can be configured to notify users via email, SMS, or other communication channels, and can include details such as the name of the workflow, the name of the failed task, and the error message or stack trace associated with the failure.

In addition to these configurable alerts, DolphinScheduler provides a dashboard for monitoring the status and progress of workflows and tasks. It includes real-time updates and visualizations of workflow performance and status. The dashboard helps users quickly identify any issues or bottlenecks in their workflows and take corrective action as needed.

Demo of the episode: Creating a simple Trino workflow in DolphinScheduler

For this episodes’ demo, we look at creating a workflow consisting of a Trino query execution managed by a workflow in DolphinScheduler.

Run the demo by following the steps listed.

Find out more about DolphinScheduler

📌📌Welcome to fill out this survey to give your feedback on your user experience or just your ideas about Apache DolphinScheduler:)

https://www.surveymonkey.com/r/7CHHWGW

--

--

Apache DolphinScheduler
CodeX
Writer for

A distributed and easy-to-extend visual workflow scheduler system