Data contracts

A data counter-part of software engineering best-practices

Wannes Rosiers
6 min readMar 17, 2023

I have been lucky enough to have been employed at a smaller Belgian company which was part of a larger European group. As they only represented 0.25% of the group revenue, they wanted to act as labo-environment to stay relevant. As such, I got the very exciting assignment to lead a team to build an entirely new data landscape from scratch. The ambition was to connect that landscape to the new operational system they were building in parallel using open-source CRM, billing and payment systems.

That’s where the excitement turned into madness. The systems were tightly-coupled as the databases were directly integrated without an API layer in between. When the chills ran down your spine upon reading “tightly-coupled” and “without API layer” in one sentence, you were not wrong. A direct consequence of this architecture was that every deployment involved all systems.

Photo by Valery Fedotov on Unsplash — Tightly coupled systems: one tiny change might collapse the entire system

The same holds true for the connection to your data landscape. For years, data landscapes have been tightly coupled with the operational systems, lacking the counterpart of an API layer to separate them and stabilize the connection. Every deployment of the monolithic collection of open source systems disrupted the data model of at least one of those open-source systems, resulting in a break-down of all data pipelines, due to this direct dependency.

Image by author — Data landscape breaking down by operational deployments

Adopting software engineering best-practices

The term API or Application Programming Interface was introduced in the 1960s and the underlying concept even dates back to the 1940s. It is used for the actual interface, the API contract, as well as for the API specifications, which describe how to build or use the interface. The main purpose of APIs is to hide the internals of a system, only exposing what is relevant for the API consumer, and keeping that consistent even when internal details of the system change. This allows software developers to freely change these internals while offering stability to those that depend on their services.

Such a self-describing and immutable interface must sound as music in the ears of data engineers that have suffered the same data landscape break-downs I did. Luckily, this part of software engineering best-practices is finally being adopted by the data world! It is called a data contract.

Contracts as interface

An API contract is something that both API provider and API consumer can agree upon. It is a shared understanding of what the capabilities of a digital interface are. Contrary to the name contract, there is little to no enforcement of API contracts. It’s up to the API producer to live up to the contract. API contracts therefore don’t mitigate disruptive change, but at least they introduce a process to communicate about them.

Photo by Romain Dancre on Unsplash — A data contract, an official agreement

Just as an API contract, a data contract is an agreement between a data producer and a data consumer. It refers to the management and intended usage of data between different teams or even different organizations. It aims to create an immutable, self-describing, reliable and stable interface towards your data. As such, it ensures reliable and high-quality data that can be trusted by all parties involved.

The adoption of data product thinking and the creation of domain-bounded data products has led to more data handovers. Data contracts bring the advantage of maturing these data handovers. When applied to the handover from the operational to the analytical data plane, data contracts have a direct impact on the stability of your entire data ecosystem.

Contracts — a semantic discussion

As mentioned in the introduction, the term API is used both for the actual interface as for the specification. During some interesting discussions I had with respect to data contracts, it became clear that the same holds for data contracts, but that it is often neglected that we use the term for two different elements.

More and more information gets published about data contracts, which leads to more and more elements being named as crucial elements of a data contract. Relevant metadata like an owner, quality metrics, access policies or a full SLA have all been considered as part of the data contract.

I agree that all this metadata is relevant, and even though I might be heading into a semantic definition: I don’t agree that all of them belong to the actual data contract. They don’t describe the actual interface, yet might increase trust in that interface.

Let’s return to the initial purpose of a data contract: hide the internals of the system, only exposing what is relevant to the consumer and keeping that consistent over time. What you minimally need to solve this problem is a description and type of the available fields, as well as expected values. Not more than that.

Photo by Nick Noel on Unsplash — A purse, a beautiful package to hide messy internals

Let me take access policies as an example. For APIs, the API token is not part of the API contract. The description on how to get an API token is not even part of the API contract. The same holds for a data contract. Having access policies in place is important, but these policies are not part of the data contract. Let’s compare this to a rental agreement: the key of your house is not part of your rental contract. It is handed to you upon signing of the contract. More about the link between data access management and data contracts can be found in another blog of mine.

Contract enforcement

Living up to the API contract, has always been the responsibility of the API producer. Recently, with the emergence of microservices, APIs have risen in popularity. This has also led to the introduction of contract driven development. A defined API contract allows API consumers to start developing before the actual APIs are being developed. As a result one can also check computationally that API producers live up to the contract.

Again, these software engineering best practices are moving to the data world: more and more vendors are claiming to build a data contract solution. Fair to say, most of these vendors are still in the very early phase. Some of them even go beyond monitoring the adherence of the contract and aim to computationally enforce the data contract.

This enforcement can be obtained on two different levels. Either by integrating with the deployment pipeline and only allowing agreed upon contracts to be deployed or by monitoring the actual data flowing through the contract. While integrating with the deployment pipeline, you aim to prevent data contract changes without being explicitly confirmed. Whereas when monitoring the data flows, you aim to prevent that data in a non-agreed format flows through your data landscape.

Photo by krakenimages on Unsplash — Successful communication and collaboration often beat enforcement

I do not believe that either of them are the way to go. These approaches are too rigid, blocking your company from being agile. As a data contract is an agreement between data producer and data consumer, I believe that you should trust the agreement process. Producers and consumers should communicate when adaptations are required. By blocking adaptations, you might introduce bigger problems: if data is being blocked to be published via a data contract, this might result in violating the freshness SLA of the data contract specification. When blocking deployments, this might result in lengthy development cycles when some changes are actually required, like removing a field or at least making it nullable when that field has been removed in the source system.

The future of data contracts

Data contracts are here to stay. One can no longer think of software development without APIs and the same will become true for data engineering and data contracts. But as we are still very early in the rise of data contracts, solutions to assist data engineers will pop-up and disappear again, leading to a more common definition of the concept. I am very much looking forward to this era of increased data stability!

--

--

Wannes Rosiers

Data mesh learning MVP. Currently building Conveyor, previously data engineering manager at DPG Media. Firm believer of the value of data.