Data Orchestration: A Conductor for the Modern Data Platform

Mike Shwe
44 min readSep 22, 2023

--

Preface: I wrote this article recently while I was a product manager at Astronomer: to explain why a data-driven organization requires more than workflow management. In order to reliably and consistently increase the availability of trusted data to a broad group of business users, you’ll need data orchestration — read on to see why.

Abstract

Data-driven companies rely on a continuous flow of data across their organizations to guide decisions critical to their business and internal operations. To enable data-driven decision-making, organizations should adopt data orchestration, a set of practices to produce and consume data products at scale. Connecting data assets can be challenging when there’s a proliferation of multi-cloud, multi-vendor data stacks — with regulatory pressures to control access to data in the face of a need to broaden the audience of both data producers and consumers. We discuss the complicated state of data in a modern company as a catalyst for a new type of data platform for data orchestration that allows these disparate technologies to work harmoniously.

At the core of a data orchestration platform lies a workflow management system for dependably handling tasks that move and transform data so that it can ultimately be used to generate artifacts to inform business decisions, including dashboards, reports, and machine-learning (ML) predictions. However, a data-driven organization requires capabilities beyond workflow management, in order to increase the availability of trusted data to a broad base of business users beyond the centralized information technology group.

1 Introduction: Every company is a data company

To be competitive, companies need to use data to drive every facet of their businesses. Numerous industry analysts have declared that “Every company is a data company” [25] [29] [27]. Business leaders need data to inform strategic decision making. Algorithms need quality data for ML in consumer and industrial settings. The ubiquity of online consumer experiences presents a rich set of opportunities for ML algorithms to tailor experiences and advertisements to customers in real time — many of these algorithms having insatiable data needs [17]. Manufacturing and industrial ML applications require a rich flow of data to monitor processes, issue alerts, and schedule maintenance. [35] [30]. A data-driven organization real izes benefits that range from improved daily operations [20] [13] [2] to discovery of new business opportunities [38] [7].

While data-driven decisions offer significant opportunities, man aging the data to inform these decisions on a continuous basis is a complex undertaking. First, you need to extract data from a diverse array of internal and external systems. After you load the data into data warehouses and analytic engines, you then need to transform the data for it to be useful. These transformations include cleaning data to correct or delete errors, deduplicating redundant informa tion, aggregating the data along similar temporal or geographic dimensions, anonymizing data to protect personally identifiable information, and integrating data from different sources.

To complicate matters, you need to automate the data workflows that perform these operations to run continually and reliably, in the face of dependencies among data processing tasks, unforeseen delays, and errors or changes in source data. You have to provide the proper level of access to analysts and ML algorithms who consume the data, to keep your organization compliant with government regulations and to ensure the data usage doesn’t violate any privacy concerns. Plus, you’re constantly adding data sources, running analytics, and managing costs.

Ultimately, you want to provide data as a product so that it can be shared, discovered, explored, understood, consumed, and secured throughout an organization, transcending organizational boundaries [15]. While we don’t cover in this paper the full scope of productizing data, we see data orchestration as essential to creating data a product. De Bruin et al [9] describe data orchestration as “the deliberate course of action to produce and consume data products at scale. It is the logistics of data-driven decision-making.”

We suggest data orchestration as a foundation for organizations to become data driven. In the next section, we examine several key challenges organizations face when managing their data. Then, in the Opportunity section we prescribe elements of data orchestration to address these obstacles. Finally, we highlight areas of active research on data orchestration to improve the availability and reliability of data throughout an organization.

As a side note, in this paper we discuss data orchestration for batch jobs instead of for streaming or real-time data pipelines. Batch processes are the common style for building data pipelines at most organizations. Most of the concepts we describe in this paper apply directly to or can be adapted for the case of streaming data.

2 Motivation: A data-driven company faces challenging data requirements

To illustrate the challenges in becoming a data-driven organization, we will use an example based on the experiences of a real financial services company. We’ll refer to this company as WealthServe. WealthServe handles the needs of its private clients through wealth management advisors, who provide investment recommendations to their clients. The wealth management advisors are compensated largely based on the trading volume of their clients. To inform clients about investment opportunities, the advisors read and rec ommend investment articles to their clients, who in turn may invest a portion of their portfolio under management. WealthServe uses an ML algorithm to help its advisors recommend articles to clients, based on their investment history, investment goals, funds under management, and personal demographics. In essence, this invest ment recommendation algorithm is similar to the more common product-recommendation scenario in a consumer retail setting. We depict the high-level steps to building and deploying the investment article recommendation algorithm in Figure 1.

With this example data and ML workflow in mind, we’ll now examine several common characteristics of data stacks in companies. As a side note, per common parlance, we’ll use the terms data stack and data architecture interchangeably. Similarly, we’ll use the terms data workflow and data pipeline interchangeably, to refer to a series of data processing steps — often involving ingesting, moving, transforming, storing, and consuming data across various components of an organization’s data architecture (aka data stack).

Figure 1: The major phases of a machine learning pipeline. Note that in a larger organization, distinct teams are typically responsible for different parts of the pipeline, as depicted in this diagram.

2.1 Heterogenous technical stacks prevail

Data stacks in today’s companies are diverse. Two primary factors contribute to this heterogeneity: the rise of anything as a service and corporate mergers and acquisitions.

The ubiquity of cloud computing provides corporate information technology (IT) with a panoply of options in infrastructure, platform, and software as a service [12]. Part of the appeal of anything as service (XaaS) via the cloud is that IT departments can get up and running much more quickly on XaaS versus software installed on premise. Moreover, functional departments within a corporation can often go live on software as a service with little to no involvment from IT. With this reduction in friction to implement XaaS and the variety of options available, corporate IT and functional divisions cherry-pick options from multiple cloud offerings [19] [37].

Despite the rise in XaaS, many companies will retain some of their technology as on-premise installations. Common reasons for on-premise applications include legacy applications and security concerns with cloud offerings for certain types of applications. In particular, organizations in highly regulated industries like finance are slower to adopt cloud-based applications due to regulatory requirements.

Corporate mergers and acquisitions (M&A) are another frequent source of heterogeneous computing environments. When two companies merge, it’s likely that there will be differences along several dimensions of their technical stacks, since there are over a dozen major public cloud providers to choose from at present — each with their own version of file storage and data warehouses — plus an abundance of other cloud-based data infrastructure providers. The parties to an M&A transaction will also have their own on-premise infrastructure and applications.

Figure 2: A combined BI and ML workflow for financial investment recommendations. The workflow spans multiple cloud vendors, operational systems, and cloud storage systems. (Not shown are downstream steps for deploying and maintaining the MLmodel.)

As an example of integrating multiple, heterogeneous systems to serve a business, consider the situation at WealthServe for investment recommendations. We depicted in Figure 1 a simplified architecture, combining a business-intelligence (BI) workflow with an ML workflow. In Figure 2, we dive a level deeper into this workflow. As with many other financial institutions and modern organizations, the heterogeneity in data processing systems arises from a combination of choosing best-of-bread infrastructure, legacy infrastructure, and M&A activity.

We see a multi-cloud workflow, involving cloud storage from multiple vendors (Amazon S3, Microsoft Azure), a data warehouse from yet another vendor (Google BigQuery), model training from Google Vertex AI, analytics via Sigma, notifications via Slack, and finally model storage back into Amazon S3. Bornstein et al [10] provide a more in-depth treatment of BI and ML data architectures employing multiple vendors.

We see a number of dependencies between steps in this data workflow, depicted by the directional arrows. Namely, a step at the terminus of an arrow can commence only when all steps at any starting point of incoming arrows have completed. For example, the transformation step in BigQuery depicted in Figure 2 depends on two upstream steps: (1) formatting and cleansing trade data in Spark and (2) joining and filtering customer data in Snowflake. When we are integrating data from multiple vendors, we should expect to see a rich set of dependencies in data workflows.

2.2 Continuous business processes require continuous integration

Data-driven organizations use data for their ML algorithms and to generate analytics for human decision making. Both scenarios are best served with continually refreshed data. However, building and maintaining continuous data pipelines can be challenging.

In our example workflow from WealthServe, we are transforming data for both analysis and ML. With this automated workflow, we are able to power a live dashboard that can show the trading patterns of different types of customers. We can also manually generate one-off reports from the data in our data warehouse. However, there are significant drawbacks with one-off analysis:

  • Data freshness: Data is constantly changing. As a result, the insights and analysis based on it can quickly become outdated. When we regularly update a dashboard through an automated process, we ensure that the data remains useful and actionable: that decisions are made using the most up-to-date information available.
  • Repeated effort: manually re-creating a data-driven analysis is not only cumbersome and time-consuming, but also it can be difficult to reproduce the exact same data transformations in successive versions of the analysis, potentially leading to spurious conclusions.
  • The most valuable data analytics rely on fresh, continuously updated data. Many data processing solutions in place fail to meet the bar for reliable, fresh data. In fact, data analysts report that the data they rely on is often out of date, unavailable, available only intermittently, or incorrect [28]. When there are problems with accessing trusted, fresh data, data analysts will resort to using either stale data or cobbling together the data they need on a one-off basis.
  • ML models for in-product recommendations, such as recommendations for on-line media or consumer products, often need to be retrained frequently, requiring data and ML pipelines to be run in production on an ongoing basis [16]. Daily retraining of models is fairly common, due to data drift or updates to data. For example, a large on-line media company retrains thousands of models daily to recommend relevant articles and products to readers.
  • Whether data and model freshness requirements emanate from analytic or operational needs, when there are lots of systems involved in data and ML pipelines, there are more opportunities for errors in the pipelines. Something will inevitably go wrong. Data from an external input might be missing or incorrect, processes might take longer than expected, or a recent code change could have negative consequences. In addition, over the lifetime of the pipelines, there will be changes to components of the pipelines. For example, data sources may change, business requirements for outputs such as dashboards will evolve. Whenever any of these components change, there is risk of introducing an error into newly deployed code. Building a continuous data pipeline that connects disparate data sources is a non-trivial task. Because of the unavoidable changes to the data or pipeline that will happen over time, maintaining the pipeline will consume significant effort as well.

2.3 Decision makers need to be able to trust their data

Each participant in the process of producing analyses or ML predictions needs to be able to trust that the data that they are using is correct. There are three features that a data architecture should have to ensure that data workflows are creating correct data:

  • Impact analysis: Proactively, we want a development environment in which we can test changes to a data pipeline before we roll them out to production.
  • Detection: We want to be able measure elements of data reliability, including data quality (the absence of errors in our data) and data freshness. When our systems detect deficiencies in reliability, we need to be alerted in real time.
  • Troubleshooting:We need tools to determine the root cause of an error. What happened upstream of this data that caused the problem?

If we are able to implement all of these features, we should be able to increase the availability of quality data that both the producers and consumers of data can trust. We can define a metric for availability of quality data:

DataAvailability ∝ 1 / (FrequencyOfIssues ⋅ AverageDurationOfIssues)

Accordingly, we can increase data availability by reducing both variables in the denominator. We can minimize frequency of issues in the first place with our proactive, preventative measure: impact analysis. And, we can reduce the average duration of issues with better detection and troubleshooting tools. In the remainder of this section, we describe each of these three methods in more detail.

To minimize data errors when they make changes to a data pipeline, data practitioners need to be able to assess the impact of changes they’re about to make — to confirm that they’re not going to break some downstream process. They need insight into what the expected downstream effects will be. Better yet, they need staging environments to empirically verify those expected changes. One of the challenges with testing code deployments across multiple environments is configuring integrations with various elements of the data stack so that each environment is a valid test of the code change, yet short of a deployment to production. For example, in a staging environment, we need to set up connections to schema in data warehouses with comparable data volumes that are similar yet distinct from the production instances.

Data consumers want to know that there are consistency checks running on their data continuously to identify not only blatant errors with missing data or incorrect computations, but also more subtle drifts over time. For the WealthServe example, a data drift might occur with how frequently clients of varying net worth are investing in a particular opportunity. After we detect a drift, we’ll need to determine whether the drift is due to a true underlying change in client investment behavior. For example, interest rates might have decreased, allowing clients of lower net worth to be more risk-seeking in their investments. Or, do we have a problem with our data pipelines?

Troubleshooting capabilities need to provide a global, consistent view of each process that changes the data. By consistent, we mean that the information about the data, such as when it was first created and its change history, should be the same for all data regardless of which systems have changed it. Hellerstein et al [19] describe this consistent view as politically neutral, to connote that it is agnostic with respect to the vendor of a particular component in the data pipeline. Similarly, Shankar and Parameswaran [34] describe a system for ML pipeline observability that is vendor agnostic, so that users can avoid vendor lock-in.

Trust in data can become even more elusive when we consider that a modern, data-driven organization encourages decentralization of how data is produced and consumed [14]. When data production is decentralized, we need to ensure that all of the data producers are following our organizational best practices for creating quality data and quickly detecting and resolving any data anomalies.

In many cases, organizations implement this type of decentralization to improve the responsiveness of data operations, to move data closer to those who understand the data best, and to avoid the bottlenecks and delays associated with enterprise data warehouses. Zhamak Dehghani coined the term data mesh to capture the move towards decentralization [15]. We discuss the needs and implications of decentralized approaches to producing data in the next section.

2.4 Companies need to provide greater access to data in the face of increased regulation

There is a tension between centralizing data production and consumption in a dedicated IT team and decentralizing this responsibility to the business units best equipped to understand the data. One of the trade offs here is that in a centralized model, it is more straightforward to comply with data governance policies imposed by industry and governmental standards. In this section, we describe an emerging model for increasing access to data.

In a truly data-driven organization, everyone needs access to data or its derivative products, including reports, dashboards, and predictions. Whereas processing and consumption of data used to be controlled tightly by a central IT authority, more and more organizations are adopting a decentralized data governance approach. Wende [40] discusses the benefits of a decentralized approach, where business units that have a better understanding of the data are better equipped to control access to the data and highlight deficiencies in data and its derivative products such as analyses and reports. Jardim and Carreras describe the hub and spoke model as a compromise between a centralized and decentralized data management architecture [11]. In the hub and spoke model, the central data team owns the standards for the data infrastructure, tooling, and process, allowing business users to extend the core data assets to build their own data products. For WealthServe, these business users include analytic engineers that are part of the wealth management team, in addition to the wealth management advisors themselves.

In other words, organizations need to enable people with varying levels of technical ability within multiple business units to both produce and consume data artifacts used throughout the organization. Lebens et al describe the “Rise of the Citizen Developer” [23], which ameliorates the traditional IT bottleneck by providing low-code and no-code authoring tools to business users outside of the IT organization. The role of any central data team shifts from creating individual data pipelines to platform development: from the centralized production of data assets (Figure 3a) to the development of a centralized platform that many teams use to produce data assets. The hub and spoke model (Figure 3b) is an example of a data management architecture in which business users can create data assets for their teams and others.

Figure 3: (a) In a monolithic, domain-agnostic data archi-tecture, there is a centralized data repository for the entire company, where only technical users create schema and populate data tables. Business users make requests to IT to load new data that they need, often finding the panoply of data tables difficult to navigate. (b) Compare to an exemplar of a more decentralized data-management strategy, the hub and spoke model. As the hub, the central IT team owns the data platform, tools, and processes that business teams (the spokes) use to build their own data products.

The trend to decentralize is potentially at odds with the increased need for data governance: managing the availability, usability, and security of data. Regulations that drive increased governance include SOC 2, GDPR, and BCBS-239. Organizations need to create uniform policies on how data is used, in order to comply with these regulations. In turn, it is incumbent upon their data systems to enforce these data policies on an ongoing basis.

2.5 Non-technical staff require tools to extract value from data

While data engineers, analytic engineers, and some data scientists will be skilled in Python and other traditional programming languages, SQL analysts and data analysts are often comfortable with only SQL. In order to scale the production and consumption of data in an organization, we need to provide tools for these analysts. However, there can be hurdles to overcome with putting the data transformations created by these tools into production.

A typical ratio of data engineers to data scientists and SQL analysts is 1 to 5 [18]. In order to empower this formidable population to write their own data transformations to extract value from raw data, the data engineer needs to provide data scientists and SQL analysts with low-code tools for data transformations [39]. Moreover, data analysts need to be self-empowered to extract and load data that they need. In a recent survey of data analysts [28], 60% of participants report that data engineering resources were often unavailable. When these data engineering bottlenecks occur, a data analyst should be able to continue with their project with the proper tools to access data.

Low-code authoring tools often take the form of UI-driven interfaces or domain-specific languages that abstract away complexities inherent in general-purpose programming languages. Notebook interfaces, including the open-source Jupyter project, are probably the most popular UI for lowering barriers to entry in coding. Over the years, there has been increased demand to put the code from notebooks — designed originally for prototyping code — into production [21]. Sahu [32] and Cardel [24] cite the lack of certain features in notebooks as problematic: scalability due to static allocation of memory, code versioning, unit testing, dependency management among cells, and continuous integration and continuous deployment (CI/CD). CI/CD is a process in which developer changes are automatically tested, merged into a shared repository, and deployed into a production environment.

Although there are tools such as Papermill [5] to help run notebooks in production environments, there remain shortcomings with these tools such as code versioning and unit testing [36]. Moreover, when something goes wrong with a notebook in a production environment, it can be difficult to track down the source of the problem, especially if the notebook contains a large number of cells or relies on external dependencies. Often, data engineers will take the prototype code from SQL analysts and data scientists and re-write it from the ground up.

2.6 Data consumers need to know which data products to use

Organizations can create data products through a variety of data management architectures, including a hub and spoke model. Consumers of data in those organizations need to know which of the published data products they should be using. How do I find the relevant data products? Are the semantics of the data aligned with my needs? Is the data being regularly refreshed? What is the quality of this data? What are the service-level agreements (SLAs) for data refreshes and data fixes when a problem occurs? Do I have the proper permissions to access this data?

Conversely, producers of data products need to be able to publish the relevant metadata along with the data products to facilitate consumption. They need to be able to permission the data products so that only consumers with the proper roles are able to access them. They also need to be able to track usage of data products so they can notify consumers ahead of time to changes to the data products, such as schema changes.

2.7 Data engineers need to embrace software development best practices

Test-driven development. Code reviews. CI/CD. Staging and production environments. Modular, reusable code. These are all best practices of software engineering. The benefits include higher-quality, more maintainable code with decreased risk of system errors upon code deployment. Embodying workflows as code unlocks these benefits to the world of data workflows.

Infrastructure as code (IaC) allows engineers in development operations (DevOps) to follow universally adopted practices for software development [31]. In the world of data engineering, the analog to DevOps is data operations (DataOps), which unites software best practices with data operations. Just as IaC facilitates good software practices in DevOps, workflow as code does the same for DataOps. With human-readable scripts that embody how data should be moved and transformed, DataOps can similarly use standard software source-control systems and code review processes [8]. Moreover, workflows specified as code can encapsulate arbitrary levels of abstraction, lending to more modular, maintainable code. Consider the drag-and-drop interfaces for Extract, Transform, and Load (ETL) popularized in the 1990s, such as Informatica and IBM Cognos. Compared to workflow as code, these ETL UIs are naturally limited by the expressiveness of their graphical UI primitives.

When workflows are embodied as code instead of proprietary binary formats, the workflows enable automated testing and documentation. DataOps engineers can create unit tests in their code, using standard unit-testing practices, such as Python assert [26]. The unit tests also document what the code is supposed to do. DataOps engineers can also increase the maintainability of their workflows as code by adding explicit documentation to the code, another best practice in software engineering.

3 Opportunity: data-driven organizations need a new type of data platform

In the previous section, we described seven challenges that a company confronts to become data-driven company–to produce and consume data products throughout the organization at scale. In this section, we describe a holistic approach to addressing these requirements: a data orchestration platform as a foundational layer of the modern data stack. The cultural pressure to drive business strategy with data means that IT organizations must provide access to a company’s data assets for a broad set of personas and skills. But in light of increasingly complex modern data stacks, the proliferation of cloud-based technologies, and the demands of new government regulations, this increase in access must be done with coordination and oversight–the mission of data orchestration.

Briefly, a modern data orchestration platform increases availability of trusted data in complex data stacks by enabling data practitioners of all levels of technical ability to build, run, and observe data pipelines as code.

We’ll dive into key elements of a data orchestration platform (Figure 4), explaining how they address the challenges we described above. Many of the elements address several of the challenges, while many of the challenges are addressed by more than one component of the data orchestration platform, resulting in a matrix that we depict later in a Recap section of the paper.

As the title of this article suggests, one can think of a data orchestration platform as being the conductor of data-related activities across an organization. Each of the components in a modern data platform plays a vital role in isolation–for example, modern cloud-based data warehouses provide robust, scalable data storage and access. Yet, the true value of a modern data platform comes alive only once all of the disparate components are working together, with data orchestration providing the coordination. Similarly, in a symphonic orchestra, the conductor brings the full value of each of the orchestral sections to life.

3.1 Running complex workflows reliably

The most vital piece of a data orchestration platform is workflow management: Without it, there is no data orchestration. A workflow management system allows you to specify, schedule, and run a sequence of data operations as tasks in a workflow while managing dependencies among tasks and workflows.

Figure 4: At the core of a data orchestration platform lies a workflow management system robustly scheduling data jobs with data ingestions from operational and data systems. To ensure that data consumers can trust their data, observability needs to be integrated tightly with workflow management. Further, to maximize the value of the data workflows across the business, the data orchestration platform adds critical services: data-pipeline development tools, management of multiple workflow deployments, integration libraries, software development lifecycle tools, and observability features.
Figure 5: A DAG that corresponds to the WealthServe workflow depicted in the Motivation section. Tasks appear in the boxes, with dependencies depicted as directional links. (This DAG is simplified for the purposes of illustrating task dependencies. In actuality, the workflow might be codified as multiple, dependent DAGs.)

3.1.1 Managing dependencies

As we depicted in the WealthServe example workflow earlier, data workflows in a heterogenous IT environment are rich in dependencies even within a single workflow: we need to integrate data from multiple sources, often through multiple steps of integration. Each of these integration steps can have multiple upstream dependencies, all of which need to be satisfied for a task to start.

Modern workflow management systems, such as Apache Airflow [1], often use an abstraction called a directed acyclic graph (DAG) to represent a workflow. Nodes represent tasks, while directed links among the nodes represent dependencies among the tasks. We can represent the WealthServe workflow with the DAG in Figure 5.

When we use a DAG to represent a workflow, each containing dependencies among tasks, we will inevitably end up with a collection of DAGs. A workflow can in turn be dependent on another workflow. For example, a workflow might produce a customer dataset that is used by many downstream workflows, for marketing, finance, and product analytics. Or, an ML workflow might depend on upstream ETL flows.

3.1.2 Scheduling jobs to deliver data on time

At the core of a workflow management system is a job scheduler that executes DAGs and tasks, respecting dependencies among tasks and among workflows. A job scheduler should have the following capabilities:

  • Triggering DAGs using a variety of options: based on time, based on other events completing, and based on change to upstream data (e.g., arrival of new data, updates to existing data)
  • Managing retries of tasks, to increase the robustness of a workflow in the face of intermittent failures in underlying infrastructure
  • User interfaces to monitor workflow execution
  • Running independent workflows in parallel
  • Considering job priorities when running multiple DAGs simultaneously
  • Alerts to notify operational staff of exceptions or SLA vio lations
  • Execution logs for debugging and compliance
  • Allocating computational resources — processing and memory — according to the needs of each task
  • Ability to process data in workflows retroactively, also known as catchup or backfill

Above all, the scheduler needs to be robust, since it is coordinating the execution of all the workflows under management. Like an air-traffic control system for commercial aviation, If it goes down, it needs to recover quickly and gracefully, catching up all the workflows to the same state as if there were no outage.

Compare the expected capabilities of a modern workflow management system with the features of a simple scheduler like cron, a Unix utility that allows you to schedule tasks to be executed automatically at a specified time or interval. While cron jobs can be useful for automating simple tasks such as running backups, sending emails, or performing system maintenance, there are several potential problems that can arise when using cron jobs for data workflows. For example, a cron job will likely fail if upstream dependency conditions are not met or if it is scheduled to run at the same time as another cron job. Cron jobs also don’t recover from failure gracefully: they don’t facilitate backfills of data in the case of pipeline failure — i.e., catching the data pipeline up to a state as if the pipeline didn’t fail.

3.1.3 Keeping track of workflow runs

As the scheduler component executes workflows, it stores a history of these runs in the metadata database. Through the workflow management UI, we can see a history of the workflow runs, including which tasks succeeded and which tasks failed. The metadata database can also store information about user logins and permissions as well as connection configurations to external data sources.

3.1.4 Specifying workflows as code.

To encourage best practices in software development during the workflow development process, a workflow management tool should store workflows as human-readable code, instead of in a proprietary binary format. This codification will need to include the actual data manipulation tasks in the workflow, dependencies among these tasks, dependencies among workflows, and time-based schedules for triggering workflows. With workflow as code, just as with infrastructure as code, we can manage versions and code reviews using standard source control systems. With version control, multiple engineers can collaborate on the same code, viewing a complete history of changes to a workflow, with the ability to roll back to the last-known good version of a workflow in circumstances where they introduce errors into a workflow.

In addition to these benefits that accrue to the transparent nature of code, workflow as code brings with it several other benefits over workflows created with drag-and-drop tools that store workflows in proprietary binary formats. In particular, for the open-source Apache Airflow project [43], workflow as code brings these advantages:

Figure 6: (a)Without RBAC, we assign data access privileges directly from a person to a data pipeline. Because of the complexity involved here, this process will likely result in under or over-provisioning access, or both. (b)With RBAC, we assign each person to one or more roles, and then we assign read and write permissions between a role and a data pipeline. With the RBAC layer, we can reuse permissioning for different roles, leading to greater reproducibility, facilitating compliance with data governance policies.
  • Dynamic: Data pipelines can adapt to runtime conditions, running tasks in parallel. For example, they can create an instance of a task for each file in cloud storage, wherein each task loads one of the files into a table in a data warehouse.
  • Extensible: New classes can inherit from existing Python classes, extending and reusing the base functionality from the existing ones, leading to more modular, maintainable code.Orchestrating and securing workflows

3.2 Orchestrating and securing workflows across heterogeneous data stacks

Given that modern data stacks lean toward heterogeneity, favoring best of breed over vendor lock-in, a data orchestration platform must similarly be vendor agnostic. Moreover, data orchestration should be able to coordinate data across public clouds and on-premise assets.

Though some organizations will prefer the orchestration platform itself to exist in the cloud, others may require it to reside within their on-premise systems. In either case, the orchestration tool will need to manage security issues across the data infrastructure involved in the orchestrated workflows. In particular, orchestration should provide:

  • Access control: Protect against both automated and manual attacks, providing strong authentication options including single sign-on (e.g., Google) and federated authentication (e.g., Okta).
  • Data security: Encrypt data in transit and at rest using strong ciphers (e.g., AES-256)
  • Infrastructure and network security: Connect securely to external data services and major public clouds using their recommended encryption methods. Log and alert DataOps upon detection of network security threats.

Access control entails not only safeguarding against unauthorized access, but also facilitating proper access. Orchestration should employ role-based access control (RBAC) to groups of workflow deployments, so that exactly the right users are able to view and modify the workflow deployments that they need to (Figure 6).

Most organizations have a data governance policy in place to ensure that data is managed correctly throughout the enterprise. Typical regulations that influence data governance include System and Organization Controls (SOC) [6] and General Data Protection Regulation (GDPR) [3]. To conform with these regulations, a DataOps team needs to demonstrate that it has designed and implemented data access controls to protect personal and private data. In the absence of a centralized administrative console, implementing and verifying data access policies becomes challenging, because DataOps needs to check deployments individually for compliance, generating distinct reports from each deployment separately. Accordingly, a key component of data orchestration is for the orchestration platform itself to be compliant with industry standard data governance policies–and to provide centralized security policy management across workflow deployments.

3.3 Integrating with a rich collection of infrastructure and applications

The power of workflow management is its ability to manage complex dependencies in data pipelines: from a rich set of data sources through data warehouses and other data infrastructure for data transformation, ultimately delivering the data to analytics, ML pipelines, or operational systems. To be truly useful then, the workflow management system needs to have a library of built-in integrations to a wide variety of XaaS and on-premise applications and data infrastructure. In this section, we describe methods for building an extensive, extensible set collections of integrations

Creating a healthy ecosystem of integrations for data orchestration is a formidable task. It would be a considerable investment for any single company to build and maintain all of the integrations, since there’s uncountably many commercial applications and data infrastructure products to integrate with. To scale this effort, we can co-opt the third-party vendors of the applications and infrastructure to help build these integrations — since it’s generally in their best interest for their products to function well within a larger data ecosystem. In addition, we can build an open-source community around these integrations to welcome an even larger number of developers to contribute to a broad, healthy collection of integrations.

Beyond supporting common business applications and data infrastructure, integrations should support technical and business use cases. One can think of each integration, such as listing the files in cloud storage, as a building block for a larger workflow. The integrations need to be created in concert with each other, so that they can combine together in a workflow or series of workflows to support entire use cases. A common technical use case involves writing data to a data warehouse table, checking the quality of the data, then publishing the data for consumers to use. Common business use cases include fraud detection and training predictive ML models.

Integrations to similar systems should have similar interfaces. For example, the API to integrate with cloud data warehouses, such as Snowflake, Google BigQuery, and Amazon Redshift, should be uniform. Uniform interfaces simplify the process of writing code in the first place, since you needn’t learn an API signature every time you’re working with a different underlying data warehouse, for example. Moreover, uniform interfaces minimize changes to code when shifting to different platforms. For example, if you’re prototyping with different data warehouses or migrating pipelines between data warehouses, you should need to modify only the connection information for the data warehouses.

A potential drawback of uniform interfaces is that they might prevent the workflow management tool from taking advantage of features specific to a particular vendor’s offering. However, when there are some commonalities among at least some of the vendor offerings, the API can still provide a uniform interface to abstract away the idiosyncrasies of the vendor tools. For example, an API for loading data from certain types of cloud storage to a data warehouse table can use native path optimizations in the data warehouse that load data directly from file storage into data tables, bypassing intermediate processing in the workflow engine [46].

The workflow management needs to have an extensible means for integrating with custom, in-house applications, such as the transactional trading system in our WealthServe example. The integration library can achieve this extensibility through class inheritance, so that any developer can inherit core capabilities from base classes as they build a new class for their custom application. For example, developers can create custom operators in Airflow by extending the BaseOperator class [47].

3.4 Tracking data along its journey

Once our data pipelines go into production, we need a suite of capabilities to increase our trust in the data–knowing how the data was created, changed, and used over time, as described by Hellerstein et al [15]. We want to ensure that data is correctly produced and derived from its upstream sources. When data is missing or incorrect, we’ll want to do forensic analysis to find the root cause of the problem. Data observability is the term used to describe these capabilities. In this section, we examine key elements of data observability, through questions and scenarios it helps us work through.

3.4.1 Determining data quality

Data observability helps us answer questions about the freshness and correctness of our data:

  • Is the data fresh? Is the data updated soon after its upstream sources are updated?
  • Is the data correct? Do our metrics on data quality match our expectations? For our WealthServe example, our data quality metrics might include checking whether values are within certain expected bounds, such as total number of weekly trades for all clients.
  • Is the data derived from the correct sources? For example, what upstream data fields were used to calculate a client’s net worth? Is bad data being filtered out of these calcula tions?

3.4.2 Establishing the root cause of a data problem

Once we identify a problem in our data, we need to troubleshoot the source. • Data availability: What is preventing a transformation from completing successfully? For example, an upstream source might have changed its schema to be incompatible with our transformation logic.

  • Data freshness: Why is it taking so long for a sequence of transformations to finish processing? Is there an upstream process that has failed?
  • Data correctness: Where is invalid data coming from? Do we have errors in our transformation logic or are the data sources introducing the errors?

3.4.3 Reporting for compliance

  • How is data derived? For example, can you prove that you’re using only the intended, official data for a specific calcu lation? Or are you using a dataset that’s incomplete or improperly vetted?
  • Is data processed from only compliant sources and through only compliant processes?
  • Is the data going to only those systems where it is supposed to go? Are we adhering to privacy rules, per user consent? • Has the data been accessed by only those who are supposed to access it?

Lineage refers to the record of the history and relationships of data elements as they are transformed, combined, and moved through a system. Lineage enables many of the capabilities of data observability we described. In particular, lineage gives us traceability (e.g., “How was this metric calculated?”), debuggability to trace an error back to its source, auditability to provide a record of the data transformations that have taken place, and improved perfomance by identifying bottlenecks in the workflow. The lineage function needs to connect to every operational data service in your technical ecosystem: all the operational systems in your ETL, business intelligence, and ML pipelines. It needs to record events as they happen, so that you can be alerted to potential data corruption issues in a timely manner: when tasks have failed, a dataset becomes stale, or the shape of data changes outside of acceptable ranges.

Data practitioners use lineage to understand data complexities through a visual map. Lineage needs to provide timely alerts to potential data issues, with enough actionable context for quick isolation and remediation. The visual map should show all underlying transformations that the workflow scheduler is managing whether they are a Spark job or a SQL query. The map should also provide lineage information across workflows that are dependent on each other, so that we can identify problems in upstream workflows once we detect an issue in our data.

Figure 7: Data lineage needs to record metadata at the start and end of all tasks in a data workflow so that the metadata can be used to troubleshoot workflows in flight. Meanwhile, we can monitor data quality at the end of the workflow to detect when we need to enter a troubleshooting phase.

Data lineage needs to be embedded into the workflow engine itself, so that there is constant transparency within workflows in flight. Suppose an active workflow is taking longer than expected, or is throwing warnings or less-than fatal error messages. With observability running in the workflow engine, the data engineer can see which tasks have completed successfully as context for which tasks are stalled or potentially at risk: they can see timestamps for all tasks as well as the data inputs and outputs for all tasks, while the workflow is in flight. By comparison, if lineage is implemented outside of the workflow engine, such that most atomic unit of visibility is an entire workflow instead of a task, a data engineer will have visibility into errors only after an entire workflow has finished successfully or failed. Lineage can be implemented at the level of an entire dataset or more fine-grained, at the level of each partition or even each field in the table. While column-level lineage requires more computational and storage resources, as well as a deeper level of integration with underlying infrastructure, it can be much more useful in a forensic investigation when a field value is incorrect, since you can see exactly which fields upstream were used to compute the erroneous value. Similarly, we can see the specific downstream fields that a given field affects so you can track down what other analytic artifacts, such as dashboards or ML datasets, will need to be revised. Because of the increased forensic abilities that field-level lineage provides, it’s no surprise that many regulatory institutions require it as part of their standards: GDPR, HIPAA, CCPA, BCBS, and PCI.

Lineage works best when we can create an end-to-end record of any changes to data from its original source to the point up to and including where it is consumed by analytic and operational systems. That is, lineage should follow the data into those analytic and operational systems, so that we can, for example, trace lineage from raw data in a Parquet file to metrics on a chart in a dashboard. In all likelihood, an end-to-end data journey will traverse systems from multiple vendors, as we described in the Motivations section. So, we need an open standard for these vendors to write their metadata to. A data orchestration platform likewise needs to embrace this open standard–reading from the metadata deposited by other vendors and writing to the repository via the workflow management system.

Both the third-party vendors and the data orchestration platform have a vital role in creating a rich lineage record. The vendors of each component in a data workflow will have the best access to the details of the data transformation in their system–for example, a database will know all the details of how a table was produced from upstream tables. On the other hand, only the data orchestration layer will have access to the larger context in which the data transformation was done. The orchestration layer can track all of the changes to the workflow over time (through our CI/CD process), which we can use to troubleshoot our current data pipeline issue.

OpenLineage, led by the Linux Foundation, provides an open standard that vendors and the data orchestration platform can write their metadata to [4]. OpenLineage provides an API for capturing data-changing events, a metadata repository reference implementation (Marquez), libraries in common languages for querying the repository, and pre-built integrations with common data pipeline tools.

The DataOps team will typically own data workflows during the entire time that the workflows are running in production: from the moment the workflow is pushed into production using CI/CD practices, the DataOps team is on the hook for the data SLA negotiated with the consumers of the workflow’s data products. After the DataOps team is alerted to a data issue, it needs to troubleshoot the root cause. The lineage capabilities of a data orchestration platform provide the foundation for this troubleshooting operation.

The orchestration platform can connect data quality checks to operational lineage to enable precise troubleshooting (Figure 7). In other words, the orchestration platform helps us segue from monitoring to observability. Once we’re alerted to statistical anomalies, we can utilize the platform’s observability capabilities to ensure that there hasn’t been an error in the data or its processing in our data pipelines.

We can use our observability capabilities to determine if our data is correct. If there are inconsistencies, then we can view the nature of the change — e.g., did the problem arise slowly over time or happen abruptly? We can use the timing and nature of the change to connect the data quality issue to some upstream change: a change to our data transformation or a fault in our source data. Ultimately, data drift errors caused by faulty data pipelines can be subtle and difficult to detect with any statistical methods without yielding a high false-positive rate [33]. In such cases, one needs to fall back to strong lineage capabilities in the data pipeline, to be able to do detailed inspection of data at each stage of the pipeline.

3.5 Enabling data practitioners of all technical abilities

In a data-driven organization, we want to maximize the benefit of data insights by enabling as many publishers and consumers of data as possible: data democratization. Technical abilities for working with data will range from data engineers to data scientists and ML engineers to data analysts and business analysts. A data orchestration platform should provide tools for each of these types of data practitioners to build and run their data pipelines, while providing the organization at large with the ability to implement a robust process for deploying code to production.

Enter development platforms of all varieties: from low-code interfaces to libraries or software development kits (SDKs) for traditional programming languages. Some users will prefer graphical user interfaces, while others will hold on to their favorite code editors. There are pros and cons to each of these, as we described in the low-code section above. The best data orchestration platforms will provide options for all manner of developers, including citizen developers. Each of these options should generate human-readable workflow as code for the workflow engine to execute, instead of only machine friendly, proprietary binaries that aren’t amenable to traditional source control systems. Ideally the platform will include business analysts as producers and consumers of data, with interfaces that embrace spreadsheets that business analysts are most comfortable with.

The data orchestration platform should embrace third-party development tools, in addition to providing some of its own. As we described in the Motivations section, it can be challenging to put code into production when that code originates from traditional notebook interfaces like Jupyter notebooks. A data orchestration platform can overcome many of the challenges by creating development UIs with certain constraints that facilitate putting the code into production. Consider some examples of these constraints that a development UI could enforce:

  • Affordances to explicitly specify dependencies among cells, for cases when the dependencies don’t immediately follow from the order that the cells appear in the notebook.
  • Ability to specify connections to cloud storage and data warehouses, such that these connections are used by the rest of the orchestration platform.
  • Constraints on the SQL that the developer can run, by using forms to specify the SQL. For example, we might want to enforce or encourage developers who are creating tables to provide comments at the table and column level to feed a data dictionary. In turn, consumers of the data table will be able to have greater insights into the semantics and quality of the data product.

3.6 Managing the lifecycle of data workflows

A data orchestration platform should allow us to apply principles of software development life cycle (SDLC) management to data workflows, where the workflows are embodied in code. We want to use CI/CD tools to automate the SDLC process. Through native or integrated capabilities, the platform should allow the workflow development team to manage multiple environments (e.g., development and production), so that code can be developed, tested, and reviewed in a methodical, repeatable process.

The data orchestration platform should enable a variety of workflow authoring tools: from traditional general programming languages such as Python with SDK libraries to interactive development environments such as notebooks. In Figure 8a, we depict a process in which the workflow developer chooses from this suite of workflow authoring tools, testing their code using a local instance of the workflow management tool and a sandbox database instance. Then, they store their code in the dev branch of the source code repository, using the CI/CD pipeline to test and deploy the code to the dev workflow instance. Once the new code is reviewed, tested, and merged into the main branch, the CI/CD pipeline deploys the code to the production environment.

Figure 8: A data orchestration layer allows the workflow author and DataOps team to manage the development of a workflow from its inception to deployment in a production environment. (a) a configuration for a single team and single environment. (b) a configuration for multiple teams–Client Services and Marketing from our WealthServe example — and multiple environments to provide for safe promotion of code.

Throughout this process, the orchestration layer needs to keep track of the connection configurations to the data services involved in the data workflows: typically file stores, data warehouses, and analytic engines. These connection configurations contain account credentials and other data service parameters, such as user name, password, schema, and database. The workflow author, in conjunction with DataOps, will need to use the orchestration layer to configure distinct connections for the development and production environments.

Overall, it is the responsibility of the orchestration layer to manage all of these development environments and coordinate the movement of workflows and configurations across them. For example, the code that the developer runs in their authoring tool’s local workflow management environment should be the same code that runs in the development and production environments, with only changes to the connection configurations for the data stores. A change to the code in the development environment should trigger a series of code reviews, tests, and deployments so that the code is deployed to production in an automated manner.

3.7 Managing workflow operations and access across multiple deployments

In the simplest deployment of a workflow engine with a small set of users, managing operations and access is straightforward: We need only one username and password and we can use built-in capabilities of the workflow engine to monitor how workflows are progressing, ensuring that none has failed.

However, for most real-world applications, we need a more elaborate configuration: multiple deployments of the workflow engine, for different stages of code development and for different sets of users–based either on their technical skills (e.g., data engineers versus data scientists) or business group.

A data orchestration platform allows us to manage a multitude of workflow management deployments with centralized administration in a cloud-based control plane. The DataOps team should be able to choose from a library of configurations, or templates, to provision deployments methodically. Moreover, DataOps should have a single view of all deployments, where they can track health across a landscape of workflow engine deployments: reducing missed service level agreements (SLAs) by viewing the history and current status of workflows running in all the deployments. DataOps should be able to trace the end-to-end journey of each dataset using the platform’s observability features, remedying data quality issues early, increasing data availability and data trustworthiness over time.

3.8 Facilitating discovery of data products

A data orchestration platform should integrate with third-party data catalog products. These products allow data consumers to find data products within their organizations to fit their business needs. Data catalogs allow for discovery and understanding of data products, by displaying metadata about the data product, such as how often it is updated, table- and field-level lineage, and descriptions. All of this discovery and information is grounded in access controls, so that exactly the proper people are able to discover data that they have permissions for.

The role of the data orchestration layer here is to facilitate collection of the correct information for the data catalog, keeping the metadata up to date over the lifetime of the data product. For example, the workflow authoring tools in the orchestration layer can encourage or even enforce data schema documentation, which the data catalogs displays. Orchestration should also collect lineage information that can likewise be displayed in the data catalog.

4 Recap: a data orchestration platform serves the needs of a modern, data-driven organization

To become a data-driven company, you need to embrace using data as a substantive input to your business decision-making processes, to your internal operations, and to your user-facing products. This change sets a high bar for continuous delivery of quality data throughout your organization. However, there’s no shortage of technical challenges: heterogeneous data sources, providing the right data access to the right users, securing data from nefarious activity, empowering non-technical users to produce and consume data products, minimizing downtime in data workflows, catching errors, and enabling troubleshooting when problems arise.

While workflow management lies at the core of data orchestration, it is not enough. To enable production and consumption of data products at scale, you’ll need a means to manage multiple workflow instances, managing role-based access across those instances, with fine-grained visibility to all of the tasks running in those workflows. The larger the company, the more complicated its data needs will be, further driving the imperative to implement data orchestration. In Figure 9, we summarize how the elements of a data orchestration platform address the challenges of modern companies.

In a companion paper to this one, de Bruin et al [9] describe a maturity model and adoption horizons for organizations to implement data orchestration. They contend that as organizations grow, the imperative for data orchestration also grows—to the extent that large organizations that fail to adopt data orchestration will actually regress in their ability to use data. The preferred model is for organizations to progress through three data orchestration horizons to decentralize the ownership of data from their IT organization — starting with centralized ownership, to domain ownership, to self-service. A data orchestration platform should support all of these adoption horizons. In particular, to support self-service, the platform needs to empower citizen developers to produce and consume data products.

5 Possibilities for enhanced data orchestration

Data orchestration platforms are already operational in hundreds of organizations. The orchestration platforms in place are either built internally or implemented via commercial products like Astro [49]. In either case, we contend that the capabilities described in the Opportunity section are critical, achievable elements of the orchestration layer.

Where can data orchestration go from here? We suggest two promising areas for further development.

5.1 Automated pipeline repair

Currently, with even the best forensic tools–centralized administration and observability–humans are required to identify and fix the root cause of a data outage. Workflow engines themselves are likely to have basic mechanisms for managing certain types of errors. For example, the workflow can attempt retries for failed tasks using an exponential backoff strategy, creating increasingly larger time intervals between retries. A data pipeline author can also explicitly encode error-handling logic into the pipeline, by triggering tasks based on the failure of others, similar to what they would do with a try-catch block in a traditional programming language.

Ideally, we’d like to specify the SLA for a data pipeline and have it dynamically reconfigure itself to meet that SLA. In our WealthServe example, suppose we have an SLA for the data transformations in Spark to be completed by a certain time every day. If our orchestration layer detects that an upstream delay will cause these transformations to be delayed, the orchestration layer can proactively allocate more Spark computing resources to complete these transformations on time.

In another type of dynamic workflow repair, the orchestration layer can detect that incorrect data is being produced at a certain stage of the workflow, using the quality checks and data drift monitoring that we described earlier. The orchestration layer would then need to re-run upstream tasks, subsequently ensuring that the previously erroneous stage is now passing data quality checks. Then, the orchestration layer, with its knowledge of dependencies within and across workflows, can re-run exactly those workflows and portions of workflows that are downstream of the newly repaired task.

5.2 Code generation to facilitate workflow development

There have recently been significant advances in using machine-learned language models to generate natural language text and code in traditional programming languages, including SQL [4,27].

Data practitioners from a broad range of technical skill sets can benefit from code-suggestion techniques. Skilled programmers can increase their coding productivity by minimizing time to research coding issues and examples on public forums such as Stack Overflow: they can simply enter a description of the code that they want in natural language, supplementing the description with a sample function header. Instead of links to related articles, they get actual code suggestions. By contrast, citizen developers, even those with little SQL experience, can use the ML-based suggestions to develop prototypes of their business requirements. Instead of specifying their data requirements to a data engineer, the citizen developer can use their natural language specification in a coding editor of choice to elicit suggestions from the coding suggestion module.

There are known limitations to how well these ML models can do in areas of limited context—eg., when you’re starting from a blank slate, entering some comments for what you want your code to do [6,37]. With limited context, the models might suggest either nonsensical code, erroneous code, or code that is cloned from a specific training example, instead of generated collectively from multiple training examples. A data orchestration platform, properly integrated with a code recommendation system, can provide needed context to the recommendation system in the form of related workflows, since the orchestration platform has access to and context for how various workflows are related across the enterprise.

6 A world without data orchestration

We can demonstrate the need for data orchestration by examining a counterfactual scenario. Consider a large company attempting to be data driven using a hub-and-spoke model for data management in the absence of data orchestration. Suppose that this company also has a workflow management system in place, like Airflow. Without development tools for business users and data scientists, only data engineers can write data workflows. Citizen developers and data scientists either create prototypes or written descriptions of the data workflows that they need. After their requests sit in a long queue to be translated by an overworked cadre of data engineers, there’s extensive back-and-forth communication and iteration between the workflow requester and data engineer to arrive at the data workflow that the workflow requester actually needs.

In fact, we’re following an anti-pattern. Magnusson explains why “Data engineers shouldn’t write ETL,” in part because of this unscalable backlog issue [22]. To make matters worse, we’re in a race to the bottom, where all of our most talented data engineers leave because they want to set up scalable data platforms, tools, and processes, as we depicted in Figure 8. Only the mediocre ones will be left to create data workflows based on someone else’s specifications.

We’ve also got some serious data governance issues. Without RBAC, we’re finding that some data consumers are able to access information that they’re not supposed to. And they’re unable to access information that they do actually need. Although we’ve created multiple deployments of Airflow, we’re finding it impossible to continually enforce a uniform data access policy across the deployments. We have no central mechanism for capturing data access audit information, so whenever we have to perform an audit, it’s a manual nightmare.

When our data pipelines break, we routinely violate the SLAs that we’ve established with our business users who consume the data. When our business users tell us the dreaded “This number doesn’t look right,” we’re caught flat-footed because we have no data quality checks in place to warn us that our data pipelines have been in error for more than a month. We take weeks to diagnose the source of the problem because we lack field-level lineage that explains the relationship of upstream data sources to our faulty data. And then we need a few more iterations to fix all of the errors downstream of our data problem because we don’t have a clear map of what data workflows and data fields are affected.

We’re spiralling downward. As our business users are requesting more access to data, we’re having to lock it down further to align with our data governance policies. Many of our talented data engineers have left the company–our beleaguered remaining force is busy trying to troubleshoot data problems. Their backlog for building data workflows grows daily. The tension in IT is palpable. It appears that the only way out of this conundrum is to take a step back and implement a data orchestration platform.

Acknowledgements

I wish to thank the following colleagues from Astronomer for their comments and contributions to this story: Ash Berlin-Taylor, Bolke de Bruin, Pete DeJoy, Andrew Godwin, Steven Hillion, Vikram Koka, Julien Le Dem, Roger Magoulas, Kaxil Naik, Scott Yara, and Steve Zhang. My thanks also to Natacha Crooks from University of California, Berkeley.

References

[1] [n.d.]. Apache Airflow. https://airflow.apache.org/.
[2] [n.d.]. Coca-cola is using AI to put some fizz in its vending machines. https://foodandbeverage.wbresearch.com/blog/coca-cola-artificial intelligence-ai-omnichannel-strategy.
[3] [n.d.]. Complete guide to GDPR compliance. https://gdpr.eu/.
[4] [n.d.]. OpenLineage. https://openlineage.io/.
[5] [n.d.]. papermill: Parameterize, execute, and analyze notebooks.
[6] [n.d.]. System and Organization Controls: SOC Suite of Services. https://us.aicpa.org/interestareas/frc/assuranceadvisoryservices/sorhome.[7] 2022. How does Starbucks use machine learning (ML)? https://dev.to/mage_ai/how-does-starbucks-use-machine-learning-ml-1aml.
[8] Alumni Network. 2017. The rise of the Data Engineer. https://www.freecodecamp.org/news/the-rise-of-the-data-engineer-91be18f1e603/.
[9] Bolke de Bruin, Tim Daniels. 2023. Enterprise Data Orchestration. In International Conference on Information Systems (ICIS) 2023.
[10] Matt Bornstein, Jennifer Li, and Martin Casado. 2020. Emerging architectures for modern data infrastructure. https://a16z.com/2020/10/15/emerging-architectures-for-modern-data-infrastructure/.
[11] Connor Carreras Cesar Jardim. 2021. The hub and spoke model: How alteryx uses designer cloud for product analytics. https://www.trifacta.com/blog/the-hub-and-spoke-model/.
[12] Stephen P Crago and John Paul Walters. 2015. Heterogeneous Cloud Computing: The Way Forward. Computer 48, 1 (Jan. 2015), 59–61.
[13] Maren David Dangut, Ian K Jennions, Steve King, and Zakwan Skaf. 2022. A rare failure detection model for aircraft predictive maintenance using a deep hybrid learning approach. Neural Comput. Appl. (March 2022).
[14] Zhamak Dehghani. 2019. How to move beyond a monolithic data lake to a distributed data mesh. MartinFowler.com 20 (2019).
[15] Zhamak Dehghani. 2022. Data Mesh: Delivering Data-driven Value at Scale. O’Reilly Media, Incorporated.
[16] Emeli Dral and Elena Samuylova. 2022. To retrain, or not to retrain? Let’s get analytical about ML model updates. https://www.evidentlyai.com/blog/retrain-or-not-retrain.
[17] Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The Unreasonable Effectiveness of Data. IEEE Intell. Syst. 24, 2 (March 2009), 8–12.
[18] Tristan Handy. 2022. Hiring a Data Engineer. https://www.getdbt.com/data-teams/hiring-data-engineer/.
[19] Joseph M Hellerstein, Vikram Sreekanti, Joseph E Gonzalez, James Dalton, Akon Dey, Sreyashi Nag, Krishna Ramachandran, Sudhanshu Arora, Arka Bhattacharyya, Shirshanka Das, Mark Donsky, Gabe Fierro, Chang She, Carl Steinbach, Venkat Subramanian, Eric Sun, U C Berkeley, and Dataguise Linkedin. 2017. Ground: A Data Context Service. https://citeseerx.ist.psu.edu/document?repid= rep1&type=pdf&doi=fd51bef5d4c12bb1fa49457e5a44fef0b8bc1295.
[20] Vance Hilderman. 2022. AI in the sky: How artificial intelligence and aviation are working together. https://interactive.aviationtoday.com/avionicsmagazine/may-june-2022/ai-in-the-sky-how-artificial-intelligence-and-aviation-are-working-together/.
[21] David Johnston. 2020. Don’t put data science notebooks into production. https: //martinfowler.com/articles/productize-data-sci-notebooks.html.
[22] Jeff Magnuson. 2016. Engineers shouldn’t write ETL: A guide to building a high functioning data science department. https://multithreaded.stitchfix.com/blog/ 2016/03/16/engineers-shouldnt-write-etl/.
[23] Mary Lebens, Roger Finnegan, Steve Sorsen, Jinal Shah. 2021. Rise of the Citizen Developer. Muma Business Review 5, 12 (Dec. 2021).
[24] Barry McCardel. 2022. Notebooks weren’t built for the modern data stack. https://hex.tech/blog/notebooks-modern-data-stack/.
[25] Ron Miller. 2014. Actually, Every Company Is a Big Data Company. https:// techcrunch.com/2014/05/22/actually-every-company-is-a-big-data-company/.
[26] Jun Wei Ng. 2022. Writing unit tests for an Airflow DAG. https://medium.com/ @jw_ng/writing-unit-tests-for-an-airflow-dag-78f738fe6bfc.
[27] Amir Orad. 2020. Why Every Company Is A Data Company. https://www.forbes.com/sites/forbestechcouncil/2020/02/14/why-every-company-is-a-data-company/.
[28] Ross Perich. 2020. 2020 trends in data analytics. https://www.fivetran.com/blog/analyst-survey.
[29] Foster Provost and Tom Fawcett. 2013. Data Science and its Relationship to Big Data and Data-Driven Decision Making. In Big Data. Vol. 1. Mary Ann Liebert, Inc., 51–59.
[30] Venkatesh Rajagopalan and Arun Subramaniyan. 2018. Predicting known unknowns with TensorFlow probability — industrial AI, part
2. https://blog.tensorflow.org/2018/12/predicting-known-unknowns-with-tensorflow-probability-part2.html.
[31] Chris Riley. 2015. Version your infrastructure. https://devops.com/version-your-infrastructure/.
[32] Prabhat Kumar Sahu. 2022. Should you use jupyter notebooks in production? https://neptune.ai/blog/should-you-use-jupyter-notebooks-in-production.
[33] Shreya Shankar, Rolando Garcia, Joseph M Hellerstein, and Aditya G Parameswaran. 2022. Operationalizing Machine Learning: An Interview Study. (Sept. 2022). arXiv:2209.09125 [cs.SE]
[34] Shreya Shankar and Aditya Parameswaran. 2021. Towards Observability for Production Machine Learning Pipelines. (Aug. 2021). arXiv:2108.13557 [cs.SE]
[35] Arun Subramaniyan. 2018. Industrial AI: BHGE’s physics-based, probabilistic deep learning using TensorFlow probability — part 1. https://blog.tensorflow.org/2018/10/industrial-ai-bhges-physics-based.html.
[36] Google Cloud Tech. 2019. Reusable Execution in Production Using Papermill (Google Cloud AI Huddle).
[37] Aaron Tilley. 2021. Battle for the cloud, once Amazon vs. Microsoft, now has many fronts. Wall St. J. (East Ed) (July 2021).
[38] T Vafeiadis, K I Diamantaras, G Sarigiannidis, and K Ch Chatzisavvas. 2015. A comparison of machine learning techniques for customer churn prediction. Simulation Modelling Practice and Theory 55 (June 2015), 1–9.
[39] Mikio Braun Webb, Ben Lorica. 2022. 2022 Trends in Data, Machine Learning & AI. Technical Report. Gradient Flow.
[40] Kristin Wende. 2007. A model for data governance — organising accountabilities for data quality management. ACIS 2007 Proceedings.

--

--