Data contracts: The missing foundation

Tom Baeyens
18 min readMar 24, 2023
Photo by Kelly Sikkema on Unsplash

This article will show how data contracts solve some of the crucial problems that data teams struggle with today, how they can be introduced without friction in existing data infrastructures and how this new foundation simplifies integration of the many tools in the data stack.

Problem #1: No-one seems to care about encapsulation in data

In software engineering, encapsulation is the proven mechanism to divide and conquer larger software systems, and it goes without saying that without encapsulation you get unmaintainable spaghetti code. Now whilst there are some major differences between software engineering and data engineering, they’re similar enough that many of the best practices that originated for software engineering are extremely relevant and beneficial for data engineering. In data engineering, I believe data contracts are the missing link to introduce encapsulation and create more transparency between data producers and data consumers. In this article, I aim to discuss the value of data contracts, address the key reservations that people have expressed, and project how contracts will drive a profound change in how we manage and configure tools in the data stack.

Data contracts are the missing component in an organization’s data architecture. To create high-quality, reliable data systems, you need to be able to test contracts and observe data pipelines. Data observability is a reactive approach that enables data engineers to detect and diagnose data issues once they have occurred. Data contracts are a pro-active, preventative approach that focus on reducing the number of data issues as early as possible, before they can take place. Because data issues that don’t happen, won’t need to be diagnosed. Now that’s what I would call a big win. Pieces are starting to fit together for contracts to become foundational for the analytical data architecture.

It’s clear that we quickly and easily accept that preventing data issues is impossible and so we have focused our efforts on solving how to diagnose issues faster. Tristan Handy, Founder and CEO of dbt Labs recently wrote a thought-provoking article on the subject, titled ‘Interfaces and Breaking Stuff’ ( https://roundup.getdbt.com/p/interfaces-and-breaking-stuff ). I particularly love this statement:

“Don’t make changes that break downstream stuff. Does that feel as painfully obvious to you as it does to me? Rather than building systems that detect and alert on breakages, build systems that don’t break.”

Yes it does, Tristan — it does feel painfully obvious to me and confronted me to ask: “Why is this not yet happening in data engineering?” because we can take so much guidance, best practices, and inspiration from software engineering that we can use design, develop, and maintain better pipelines. We are experiencing too many blockages and leaks and not enough provisioning for a reliable supply of good data.

And so, is it not feasible then to just not make changes upstream that will inevitably break stuff downstream? I know that sounds flippant but up to now, that has not been the default approach in data pipelines. Instead we see exports of complete databases, scripts that export data without ownership, and purchased (3rd party, external) data without proper documentation nor guarantees of data quality.

A common source of breaking changes is ingesting the internals of operational systems, as data is often just ingested through a database export. The team managing the operational systems will not realize, nor understand, the type of impact on the analytics team who uses the data for insights and reports, when making a change to their internal database model. At that time, they will break your dependency on the data you consume. That’s why I think taking the basic principle of encapsulation as used in software engineering, will be a major shift to solving this problem. As Petr Janda in his article, ‘The art of drawing lines’, talks about the “encapsulation of cohesive logic” as the main reason for driving architecture, data design 101 equals encapsulating what changes from what doesn’t.

It’s still common in data products to rely on internal data from other systems. That breaks encapsulation and is a frequent cause of data issues.

Andrew Jones, Principal Engineer at GoCardless, makes the same observation on how changes can break a lot of things and the need for encapsulation, in his article ‘Improving Data Quality with Data Contracts’:

“You need to stop using the internal data models and build explicit interfaces between the data generators and the data consumers — much like an API.”

Consuming data should only be done through a clear interface (API) that is provided and guaranteed by the team producing the data.

Here’s an excerpt from another blog written by Tristan, ‘Coalesce. Data Contracts. The Semantic Layer’:

“…never sync data directly from a database. Instead, build an API for any data that you want to sync into your data warehouse and extract all data via this API.”

Every data ingestion and transformation step is a software component and hence it needs to be ‘treated as a product’, which includes proper encapsulation. Whilst it may not always be possible to just prevent analytical data from getting broken, encapsulation does provide a good frame of reference to start addressing the problem and avoiding spaghetti — or clogged — pipelines. Look for the handovers between teams to find the most vulnerable dependencies on internals that should be replaced with dependencies on clear APIs.

All this aligns with the groundwork that has been laid out by Zhamak Dehghani and work on the principles and logical architecture of Data Mesh, which recognized that a paradigm shift was needed in how we manage and decentralize data at scale, whilst following the same principles as APIs in microservices software engineering. One of the four principles of data mesh is “domain-oriented decentralized data ownership” and enabling data practitioners to create, share, discover, and use data. In my opinion, it’s this that has most likely set data contracts into motion with the need — and requirement — to ensure the collaboration and handover between domains can exist and function.

Every ingestion or transformation step should be considered as a software component. It must be based on other components (software or data) through clear contracts. It must hide its internals from the data consumers. It must produce contracts for its outputs.

Problem #2: The broken type system

The second problem that I believe data contracts address, is a shortcoming in the storage engines. Storage engines have a type system that’s far too basic for today’s analytical data needs. We are still specifying the schema decimal places in numerics and the max number of chars in a VARCHAR as if we are still programming in the seventies.

To make the point, imagine columns ‘length’ and ‘width’ in a dataset ‘PRODUCTS’. In the storage engine, this will probably be numeric types. But is it expressed in centimeters, millimeters, inches or miles? This you have to communicate to the data consumer completely outside of the storage type system.

The data typing system of analytical storage engines don’t include proper capabilities for semantics and validity checking

All of the innovation in storage engines has been focused on increasing speed and improving scale. None so far have tackled the problem of a broken type system. In contrast, consider the job of a developer that’s coding in an object-oriented programming language where a significant effort is spent building composite data structures and enforcing data constraints. In programming languages like Java or Python, there is a proper typing system. Developers would be creating a ‘Distance’ type that’s made up of a numeric value and a unit of measure. And this distance would be used composing a larger Product class. The lack of such a proper type system in storage engines prohibits it from becoming the basis for good data management workflows.

With all the new cloud databases focused on speed and scale, I have no idea why no one ever thought to start a database with a composable type system yet. Please add a comment if I’ve missed this and someone actually has 🙂

This extra metadata is crucial for consumers that want to evaluate the data or start to use it. It should be exposed in discovery tools just like is the case today. But the system of record where this information is maintained should move from a catalog to a data contract file managed by the producer team in a git repo.

The API for data

In the last couple of months, a lot has been written on data contracts. But it’s not yet very concrete. This article aims to take a small step to clarify the current fluffy situation around data contracts. Benn Stancil navigates his way through the hype and many conversations, he’s found an interesting way to describes it in Fine, let’s talk about data contracts:

“My initial reaction to data contracts was the same as my reaction to the data mesh. Both struck me as a kind of Rorschach proposition: Something defined well enough that we can all sense its shape, but abstract enough that we can also project our own opinions on top of it.”

Can a data contract do for data what APIs did for software? Let’s take it back to software engineering. In software engineering, an API contains all of the details a developer needs to start using the functionality of the component. An API is a description maintained by the producer that hides the internals and states how the component can be used by the consumer. Now in data engineering ,for datasets like tables in databases, warehouses, or queues in streaming systems, things are a bit different and up until now, it hasn’t been obvious how to specify an API. But there is a way to specify data APIs and that’s based on the database or warehouse protocol, along with the schema of the dataset.

When consuming software engineering functionality, an API is common practice. In data, there is no common way yet to define an API for consuming datasets.

Before diving in the contents of a contract I want to argue that data contracts must be managed as code. I agree with the notion of shift-left and that the workflows to update contracts should appeal to engineers. Engineers in data producer teams will have to take ownership of maintaining the contracts and ensuring that they are in sync with the actual data. Those engineers want to work as code. It’s natural for them and easy to version control, set up new environments for CI/CD, production and so on. All these aspects imply that operational data infrastructure is best managed by engineers as code. With contracts there is a huge opportunity to become the central place for engineers to configure the aspects they need to manage for different tools and use cases.

Now let’s go back to defining the data contract as the API for data and let’s approach it from the perspective of a data consumer. To start with, when consumers search for data, what is the minimum that they need to know and understand about the data to determine that it is fit for their purpose and their specific use case?

I think the right way to look at what should go in or out of a contract, is to start from the absolute minimal contract requirements. And on that basis add a flexible extension mechanism so other configurations and metadata can be layered into the contract.

The bare minimum that a data contract must include is access details, dataset name, and schema, as I think that is the least amount of information required in order to be able to start using the data on your own.

Minimum, technical content of data contract

Access details: This represents the precise location of the dataset. It includes the connection details to the database, warehouse or similar, with variables to be used for credentials. The specification of the storage engine is important as it implies the access protocol. While a reference to the storage engine is often just a logical name to accommodate for staging environments, I think contracts should be primarily targeted at the production environment and that there should be a separate mechanism –inside or outside the contract– to override variations of the contract for staging environments.

Dataset name: Logical names may be necessary for discovery, but when using the dataset as for instance in building a report, the consumer needs to know the exact name as it is known in the storage engine.

Schema: These are the columns and the structure of the data which must be expressed in the data type system of the storage engine. If you want to use the data from a SQL engine in for instance a query, you need to know the exact SQL types of the columns.

From a technical perspective, access details, dataset name and schema is sufficient to access the data. I want to highlight the clear line between the technical minimally required information to access the data and other content in the contract file. The technical access details will always be required for any tool to work with the data. All other configurations and information that is added to the contract is there for a use case, workflow or tool. Those extra parts are related to how the data team organized their workflows. For a specific data management workflow in a company, descriptions and tags may be required. But that may be different in other companies.

From a workflow perspective, the notion of an ‘API for data’ is tightly coupled to the discovery use case. Discovery tools need to present all information for consumers to find and start using the data themselves.

Currently it’s common for discovery tools to ingest the metadata from the storage engines and present that to its users. To me, after adopting contracts, it makes sense to feed the discovery tools from contracts rather than from the storage engine’s metadata. After all that would make perfect sense: It’s only the ‘APIs for data’ that are carefully crafted and validated by producers that should go in a discovery tool. Instead of connecting discovery tools to a storage engine directly, which typically also includes all the internal datasets that are not intended for consumers.

Ideally discovery tools only publish contracts curated by the data producer

Chad Sanderson and David Jayatillake argue that semantics should also be included in contracts. Semantics is information about how to interpret data, for example the descriptions and unit-of-measure. I agree that after the metadata necessary to access the dataset, semantics are the next vital information to start using it so that there are no unexpected surprises (think about when NASA lost a spacecraft due to a metric math mistake!).

I agree with Chad and David in the sense that from a discovery perspective, it’s essential to require semantics to be included in contracts. Still in transition situations contracts could be used in other scenarios where for instance semantics are stored in the discovery tool’s database not yet linked to the contracts and that those contracts are only used for quality. Of course, in the ideal scenario the same contracts would be used for both discovery and data quality, but the point is that apart from the data access and schema details the other contract information depends on how companies use tools across the data stack to implement their use cases and workflows.

So rather than focussing on what should go in or out of a contract, I think the clue is to ensure that technical access details are required and that a generic extension mechanism for contracts to allow for extra configurations and information to be added for specific tools and use cases.

As it stands today, catalogs and discovery tools are already implemented broadly. Chances are small that discovery is going to be the driver for companies to start adopting data contracts. That’s different for data quality. A lot of companies are still looking to get started or improve their data quality strategy. So we think it’s most likely that companies adopt data contracts to implement data quality.

Quality as the first use case

Data quality is usually the very first use case for data contracts because authoring checks and schema declaratively as code fits with the engineering workflows. In this section we start by looking at how data quality can be layered into contracts and then look ahead to how other data management workflows can be handled as well.

To look at how data quality can be layered on top of a contract we can consider these conceptual categories of data quality configurations: schema check, extra data quality checks and automated observability monitoring configurations.

Content sections in a data contract related to data quality

Schema is already part of the technical essential contract information. That can be leveraged for quality checking to verify if the expected schema is still in place. A new section with quality checks can be added to a contract so that engineers can configure and manage the quality checks as code. But also automated observability monitoring configurations can be added as a separate section. Those configurations can be very minimal as for instance a simple boolean flag to activate automated monitoring. Optionally important columns may get a flag to fine tune the default automated monitoring. The main point I want to make is how for various data quality use cases, some information like schema is already present and other extra configurations like quality checks can be added to the contract.

Here’s an example of how a data contract could look like with data quality checks and observability configurations:

Example data contract

For the data quality use case, let’s walk through the steps. First, contract verification needs to be set up to ensure data quality is ok to protect data consumers from changes in the data that break their usage. Contract verification implies the schema is checked and the data quality checks are verified as well. Notifications will be configured in case the data doesn’t match.

Contract verification should be done if either new data arrives or when the contract changes. Verifying a contract when new data arrives is usually done in the orchestration pipeline itself. After the pipeline has produced new data, an extra step is added to the pipeline to run contract verification.

The contract will change most likely through git workflows like branching and pull requests (PRs). So to cover the contract update event, a contract verification is added to the CI/CD pipeline of the contract (& transformation pipeline) repo.

Contract verifications reads the contract and verifies that schema and data quality checks pass on the current data

Ensuring that contracts get verified each time data or a contract potentially changes, ensures that you can trust your contract to be an accurate description of your data. As the business changes, so does the shape of your data. Relying on the discipline of people to keep the contracts in sync with the storage engine is not enough to keep them in sync. Without automated verification, the trust in the data contract will get lost and all the potential value of the contracts are lost. But with contract verification in place, contracts can be a declarative foundation that goes beyond the “API for data” principle.

Building data management on contracts

Many data management workflows and automations can be implemented using contracts as the foundation. Just like checks were added for data quality to the contract, other configurations and metadata can be added to the contract to control other workflows like discovery, access, privacy, retention etc. A data contract is the perfect place to specify the ownership, retention and access policies and so on.

When looking at the newer set of data management tools, they adopt the principles of shift-left. Shift-left is the trend that moves interactions of data management from web based UIs to managing data tools as code by the data producer engineers. You’ll see a lot of these tools being configured with configuration or YAML files.

In this context, we see data contracts will take a central place from which all those tools can be configured. This will lead to massive simplification of managing the large variety of data tools.

Contracts become the central UX through which engineers can configure their aspects for many tools in the data stack

Now you can see that with contracts we’ve created a declarative and extensible way for engineers to manage their parts of the workflow. Contracts for the foundation for many of the data management workflows. This approach therefore has a huge potential to simplify the integration of the different tools in the data stack.

Frictionless introduction of contracts

Of course we cannot assume that companies are going to redesign their data architectures on top of data contracts. So data contracts must allow for them to be phased in step by step.

For starters, let’s assume you have an existing data infrastructure that is running. Introducing contracts can be done step by step. Identify a small scope like one dataset, one transformation or one ingestion pipeline.

A good rule of thumb is to focus on the analytical teams. There, look for the datasets that are shared across teams and that are critical to the business’ day-to-day operations. Those are the ones that require contracts as one team should never rely on the internals of a component from another team. Here’s a good rule of thumb to prioritize the creation of contracts for existing datasets:

  1. Take a consumer-first approach. Start with datasets on the consumer side and work your way upstream to the source.
  2. Next, focus on the datasets that are handovers between different teams. That’s important for encapsulation.
  3. Gradually work your way upstream to cover all datasets including ingestion.

What about datasets that an analytics team depends on and for which we can’t rely on the producer to create a data contract? This might be the situation when you purchase data from an external vendor or when an upstream data team does not yet have the capacity to produce contracts. In that case it’s the responsibility of the consumer to set up contracts and verification (see below). But it definitely makes sense to push that responsibility as much upstream and towards the producer as possible.

Next create contracts for those datasets and ensure contract verification. Your pipelines will continue to run as before but now with data quality monitoring enabled.

Once contract verification is monitoring data quality, something that you may wish to reconsider is how you’ll be connecting the datasets to your discovery tool. That tool will probably be connected directly to the storage engine to extract the metadata from there. But once you have reliable contracts in place, you could switch to ingesting the contract information instead. That will have the extra benefit that much more information can be ingested in one go.

Active sync

You may be asking yourself: How much work is this to manage these contracts manually in git? Data engineers are proficient to get this done. But there is extra tooling that can help and we’ve called this notion ‘active sync’. In order to fully appreciate what active sync does, let’s review the source of changes to contracts: The foundation is a copy of the storage engine’s schema. You could consider the storage engine’s metadata as the source of truth and hence as the source of contract changes. And separately consider all other information and configurations in contracts for all kinds of data management workflows that are managed by engineers.

An active sync tool can help engineers keep the contract in sync with schema changes while at the same time, engineers manage other configurations in the contract

The maintenance of the schema information is a great candidate for tooling automation. Engineers update transformations that cause the schema of the dataset in the CI/CD to change. Contract verification in the CI/CD flow will fail. Consider schema changes and the fact that we want contracts to be an accurate description of the data. Then we can also see that we want all schema changes to be pushed to the contract in git. There is no manual decision needed. Those schema changes can be pushed to the contract on the PR or on a branch as plain commits.

When adding new columns, you may have to consider when this contract makes it to the production contract verification. A new column may have to be marked as ‘anticipated’ if publishing of the contract is not tied to the pipeline code in the same repository.

For removed columns, it always makes sense to mark them as ‘removed’ instead of just removing the whole column. It even makes sense to push this kind of deprecation and removal in discovery tools.

Another use for the sync tool is simplifying the initial contract. Often, engineers will already have a version of the dataset in staging or production. The active sync tool can extract the metadata from the storage engine and generate the initial contract file. A contract sync tool can remove the burden of crafting these initial contracts. The basics of the contract such as access details, dataset name and schema can just be generated. Once generated, the contract is ready to be pushed and managed in git.

A contract sync tool can also simplify the getting started by generating the initial version of a contract

Conclusion

We believe that contracts will lead to a paradigm shift in data. I hope to have shown you how data contracts will become the foundation for many of the important data management workflows. They can be introduced step by step without requiring any changes to the existing data architecture and they fit with the data engineers’ way of working (shift-left). Contracts also have the power to prevent a lot of data issues by improving the workflows and communication between producers and consumers. Furthermore the common foundation will make it easier to integrate the large variety of data tools. As data contracts get adopted, analytics teams will be enabled to deal with more data in production and serve the business better with their analytical needs.

--

--

Tom Baeyens

Soda CTO & Co-founder, the leading modern data quality platform. Now building data contracts to empower everyone to share and use reliable, high-quality data.