Is DataOps just about automating data pipeline orchestration? Not exactly!

Apache DolphinScheduler
CodeX
Published in
5 min readMar 10, 2023

We often hear people say that DataOps is just about automating data pipeline orchestration — using a business process to execute a directed acyclic graph (DAG). Many enterprises may already be using orchestration tools such as DolphinScheduler or Airflow, and believe that they have covered DataOps. While automated orchestration and scheduling are indeed critical elements of DataOps, the significance of DataOps goes far beyond just data pipeline orchestration.

What’s an orchestration and scheduling

We can see that many engineers spend most of their time on inefficient manual process changes and troubleshooting, while data scientists are manually editing CSV files. Data teams can use automation technology for scheduling and orchestration to free themselves from the inefficient and monotonous (albeit necessary) parts of data jobs.

Viewing analytical development and data operation workflows as a series of steps that can be represented by a directed acyclic graph (DAG). Each node in the DAG represents a step in the process, such as data cleansing, ETL, running models, and so on. Business process tools run a series of steps under automated control, which can be run sequentially, in parallel, or conditionally.

There are many data pipeline orchestration tools available that can manage processes such as data ingestion, cleansing, ETL, and data publishing. Additionally, some DevOps tools focus on coordinating development activities, such as collaborative development environments. Most enterprises use multiple data platforms, tools, languages, and workflows to deploy data processing and analysis.

DataOps unifies this complex system, data, and processes into a coherent pipeline, where orchestration and scheduling automatically execute tasks in both the data value pipeline and the business innovation pipeline. While in reality, these two main pipelines are composed of countless small pipelines due to different workflows, personnel roles, and job content, we will broadly discuss an abstract value and innovation pipeline for simplicity.

Value and innovation pipeline

Data value pipeline orchestration

The data value pipeline extracts value from data. Data enters the pipeline and goes through a series of stages — acquisition, transformation, processing, analysis, and visualization and reporting. When data leaves the pipeline in a useful analytical form, it creates value for the organization.

In most enterprises, the data value pipeline is not just a DAG — it is a DAG of DAGs. The following diagram shows the various teams within an organization that provide data value to data consumers. Each team uses different tools. The toolchain may include one or more orchestration tools. DataOps orchestrates and schedules these underlying tools through a metadata-based process, i.e., orchestrating a DAG of DAGs.

Business innovation pipeline orchestration

Applying business processes to just the data operations pipeline is not enough. DataOps also needs to manage the business innovation pipeline, which enhances and expands analytics by implementing new ideas that generate analytical insights. It is essentially a new analytics development process and workflow. DataOps orchestrates the business innovation pipeline based on the DevOps model of continuous deployment.

As shown in the following diagram, each team in the data organization that conducts analysis and innovation has its workflow process that reflects its unique structure. This includes self-service users who are both consumers and producers of analysis. Like the data value pipeline, the business innovation pipeline is also not just a DAG — it is a DAG of DAG.

DataOps depends on orchestration and scheduling, it’s not enough!

Automated orchestration and scheduling is a key part of the entire DataOps implementation, but by itself, it cannot provide the full capabilities of DataOps. For example, Data manipulation can be fully automated, but without testing and process controls, data and code errors can propagate into analytics with serious consequences. Data teams that are constantly fighting fires cannot achieve maximum productivity. Therefore, comprehensive DataOps requires several key methods and processes in addition to data processing processes.

DataOps operates end-to-end data

When we build the unique requirements of enterprise data orchestration, we must always keep in mind the existence value of DataOps, that is, to operate and control end-to-end data and maximize the value of data. According to Gartner®’s summary, let’s take a look at the key elements of data operations:

  • Process Control — In DataOps, automated testing, and statistical process control run at every step of the data pipeline, filtering and eliminating data errors that can disrupt analysis and generate a lot of unplanned work that impacts productivity.
  • Change Management — DataOps is concerned with tracking, updating, synchronizing, integrating, and maintaining the code, files, and functional components that drive the data analytics pipeline.
  • Parallel development — DataOps organizes and divides the stages of data development so that team members can work together efficiently without resource conflicts.
  • Virtualize technical environments — DataOps virtualizes technical environments to isolate development from production. Virtualization can allow business innovation to flow more easily through the development process and into production environments quickly. When needed, data analysts can quickly spin up a development environment that includes the required tools, security access, data, and code.
  • Reuse — DataOps enables a reuse model, standardizes widely used functional and analytical components, and simplifies migration between virtual environments.
  • Responsiveness and flexibility — DataOps designs data analysis pipelines to adapt to different runtime situations. This flexibility enables analytics to better respond to an organization’s needs and changing priorities.
  • Rapid change — DataOps will architect the technology environment to achieve the shortest possible development cycle times while meeting the requirements of data consumers. The design concept of DataOps is based on change. The DataOps architecture regards dynamic data processing capabilities as the “core idea”, rather than “remedial” and makes changes afterward.
  • Team Alignment — DataOps aligns tasks, roles, and workflows to break down barriers between disparate data and business teams so they can work better together.

Summary

DataOps is not a standalone tool but a set of toolkits and methodologies that serve as an architectural framework to help users manage the planning, development, testing, deployment, and maintenance of data processing and operations. DataOps is supposed to be able to improve the way existing tools are used and increase collaboration efficiency, and related features, processes, and methodologies can be localized into a comprehensive platform to help organizations that adopt DataOps to gain a controlled and flexible environment that meets the needs of data professionals, providing powerful support for enterprises to better achieve data value and business innovation capabilities. That is the real feature of DataOps.

📌📌Welcome to fill out this survey to give your feedback on your user experience or just your ideas about Apache DolphinScheduler:)

https://www.surveymonkey.com/r/7CHHWGW

--

--

Apache DolphinScheduler
CodeX
Writer for

A distributed and easy-to-extend visual workflow scheduler system