A Data Contract: What Do We Really Need to Have?
Data is an essential driver for all business decisions. However, without defined governance and access control, data remains locked due to a lack of trust. Trust is the heart of data sharing and requires justifications.
To organize and justify trust among different involved parties, various types of contracts, including data contracts (DCs), are needed so that at any moment, one of the involved parties can refer to them.
Data contracts have become a hype among the data community since 2022. When Paypal open-sourced their data contract framework, everyone began referring to their framework and attempting to implement it in their organizations.
But,
what are the real benefits of data contracts, and what key components should an organization have in place?
Also
who owns data contracts? Data providers? Data owners? Data stewards?
Data contract definition
In simple terms, a data contract is an agreement between data producers and data consumers represented in the form of machine-readable code. According to many scholars, such as Andrew Jones, a data contract can be represented in various ways such as Protobuf for Kafka users, custom JSON, YAML, and JSONSchema. However, a data contract should use the same format across the chain (from data producers to data consumers). All parties must use one of the mentioned ways to represent the agreed contract.
Key components
PayPal’s data contract template contains a list of Data Contract (DC) requirements and components, guidelines with extensive examples on how to take the first step in the world of data contracts in an organization. Source: Open Data Contract Standard.
Starting with the PayPal template, having all the components and settings in place to implement data contracts might be difficult. Therefore, in my opinion, we can downscale to the minimum but key components as a starting point, keeping in mind that data contracts evolve over time and can be extended. The minimum components are:
Data Contract Key Components:
- Schema: Used to create an interface to the table (e.g., a table in a data warehouse).
- Data Owner: The owner of the data should be known as a contact person to manage the data. Unowned data is unmanaged and often unusable.
- Version: Supports the evolution of the data contract through appropriate change controls, which is the most valuable part of data contracts.
Benefits of Data Contracts
- Data contracts bring autonomy and freedom to data producers and engineers to define their settings and rules about the given data.
- Data contracts make data collection, sharing, and reusability easier.
- Data contracts accelerate processes and break down data silos.
- Data contracts organize trust and help data producers/owners have control over their data.
Who owns data contracts?
Data contracts are owned by data providers and help data providers be in control of their data, aligning with Data Sovereignty within federated domains.
The term “data contract” refers to a specification that is usually owned by the data provider and thus does not align with a “contract” in a legal sense as a mutual agreement between two parties. The term “contract” may be somewhat misleading, but it is how it is used by the industry. The mutual agreement between one data provider and one data consumer is the “data usage agreement” that refers to a data contract. Data usage agreements have a defined lifecycle, start/end date, and help the data provider to track who accesses their data and for which purposes.
source: Data Contract Specification | Data contracts bring data providers and data consumers together.
Useful Links and tools
- Data Contract Specification | Data contracts bring data providers and data consumers together.
- GitHub — datacontract/datacontract-cli: CLI to manage your datacontract.yaml files
- Data Mesh Manager (datamesh-manager.com)
- Data Contract Editor
This is just the beginning
This article represents my personal view on data contracts — what they are, why they are important, and how to define them. In the next articles, I will explain how we can use the mentioned key components in a use case and enforce them at runtime.
I believe that learning from each other is the best way to improve our knowledge, therefore, any feedback is welcome.