Data Orchestration — A Primer
Data scientists and data engineers are responsible for authoring data pipelines and workflows. Historically individuals wrote cron jobs to orchestrate data but today there are data orchestration frameworks that allow them to programmatically author, schedule, and monitor data pipelines. Over the past few years, we have seen the emergence of numerous data orchestration frameworks and believe it is a core component of the modern data stack.
Until recently, data teams used cron to schedule data jobs. However, as data teams began writing more cron jobs the growing number and complexity became hard to manage. In particular, managing dependencies between jobs was difficult. Second, failure handling and alerting had to be managed by the job so the job or an on-call engineer had to handle retries and upstream failures, a pain. Finally, for retrospection teams had to manually sift through logs to check how a job performed on a certain day, a time sink. Because of these challenges data orchestration solutions emerged.
Data orchestration solutions have a few components that are useful to define. First is an operator which is the fundamental unit of abstraction to define a task. Once an operator is instantiated with specific arguments it becomes a task. Each task consumes inputs and produces outputs. These data artifacts can be files, services, or in-memory data structures.
Data orchestration solutions center around the concept of a DAG (direct acyclic graphs), a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. Each node in the DAG represents a step or task in the process. Typically, a DAG is defined by a Python script, which represents the DAG’s structure (tasks and dependencies) as code. A simple DAG could consist of three tasks: 1, 2, and 3. Tasks can be triggered by different mechanisms like sequence, time between tasks, or a specific time (e.g. 8 AM). For example, a DAG can say that 1 has to run successfully before 2 or 3. The orchestrator has a scheduler that coordinates the different tasks, executors that start tasks, and a metadata store for saving the state of the pipeline.
Data orchestration solutions can power many processes including but not limited to 1) cleansing, organizing, and publishing data into a data warehouse, 2) computing business metrics, 3) applying rules to target and engage users through email campaigns, 4) maintaining data infrastructure like database scrapes, and 5) running a TensorFlow task to train a machine learning model.
The first generation of data orchestration solutions, like Luigi and Airflow, built a standard mechanism to build workflows. Airflow became wildly popular because its user interface and programming model is simple and straightforward. Prior to Airflow most schedules had a Domain Specific Language (DSL) where you wrote JSON and YAML. Airflow is in Python code the main language used by data scientists, democratizing the ability to create DAGs. Users consistently tell us they like Airflow’s approachable UI. Airflow is mostly used for ETL and Hadoop jobs.
The first generation of solutions focused on being task-driven, decoupling the task management from the task process. They have limited knowledge of the data the task is processing. For example, if X task is complete trigger Y task. We have now entered the second generation of data orchestration solutions like Prefect, Flyte, and Dagster. The second generation solutions focus on being data driven such that they know the type of data that will be transformed and how it will be manipulated. They have data awareness, can perform tests on data artifacts, and version not just the code but the artifacts.
With data-driven orchestration if X data becomes available it triggers Y task that generates an artifact that triggers downstream effects. Data-driven approaches can be active or passive. Active entails actively passing data between steps and systems. For example, Flyte automatically understands the tasks and the flow of data between them as described using a variety of Flyte DSLs like Flytekit. FlytePropeller then schedules these tasks one after the other and passes data between these tasks as the tasks complete. It also intelligently understands how the data splits into different tasks. This ability to do state transfer is an advancement compared to first generation systems like Airflow XCOM.
Passive data-driven orchestration means waiting for an event outside the system to occur before triggering a task. Orchestration systems are moving towards this ability by creating an event stream of all the task completions, workflow completions, and data arrivals to automatically trigger tasks. We have heard this is particularly useful for continuous model retraining. It is possible to do this with Airflow by periodically running a job that knows to poll for some data from an upstream job by knowing the location. However, this is error prone because if the data is delayed then by definition a job has to end so will be a failure.
While moving to be data-driven is the main step forward for data orchestration solutions these newer offerings also try to differentiate along a few additional vectors. Some of the solutions have a focus on ML workflow orchestration. Flyte is full-fledged ML orchestrator with an SDK instantiation for ML and declarative definition for ML artifacts like datasets, features, models, and model parameters. Additionally, most of the newer solutions support parameterization, which enables you to apply dynamic values to the data that you import and that you generate as part of job execution. This is helpful for machine learning. Being data aware also helps support testing in your pipeline like Dagster which allows schemas for data artifacts to be defined supporting type checking and Great Expectations for data checking. We believe these solutions can also provide data lineage as part of the core offering.
Interestingly, workflow solutions that come from the CI/CD space, like Tekton and Argo, are starting to be applied to data orchestration. For example, Kubeflow pipelines leverages Argo, an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo is agnostic to the ML definitions or artifacts. It simply lets a user chain together analytics jobs and run them based on a schedule. Similar to Flyte, Argo differentiates itself by being Kubernetes-native.
Data orchestration solutions are becoming a core piece of the data stack. We are excited to watch as the ecosystem evolves with the new players taking a more data-centric approach. If you or someone you know is working on an orchestration startup or adjacent offering, it would be great to hear from you. Comment below or email me at email@example.com to let us know.