Data Contracts: Revolutionizing Data Management and Governance in the Tech Industry

Published in

Blue Orange Digital

7 min readDec 7, 2023

In today’s tech world, effective data management and governance are critical challenges for organizational growth and expansion. Traditional approaches often lead to a disconnect between those who produce data and those who consume it, resulting in inefficiencies, increased costs, and in many instances missing time to market. Data Contracts offer a practical solution to this problem. They are agreements that clearly define the structure, format, and use of data, aiming to improve communication and reduce misunderstandings between data teams. This article explores how Data Contracts streamline data governance and management, with a focus on understanding the tangible benefits they bring to businesses by maximizing the impact in the current day’s complex data ecosystems.

The Problem

The central issue with the traditional centralized data warehouse model was the separation it created between those who produce data and those who use it. This acted as a barrier. Although data governance and management programs aimed to connect these groups, they rarely succeeded in promoting real interaction. Data warehouse teams often ended up as middlemen, which led to significant communication challenges. As a result, the data frequently suffered from suboptimal structuring, leading to schemas that were poorly aligned with their intended use cases and not fulfilling their needs completely, necessitating significant reformatting and manipulation by analytics and machine learning teams. These sets of issues snowballed and contributed to inefficiencies, making companies slow to adapt, adding unnecessary complexity, and driving up the cost of scaling operations.

To bring home the underlying issue, Chad Sanderson talks about the Garbage In and Garbage Out Cycle, which exemplifies the snowball effect:

1. Databases are treated as nonconsensual APIs

2. With no contract in place, databases can change at any time

3. Producers have no idea how their data is being used downstream

4. Cloud platforms (Snowflake) are not treated as production systems

5. Datasets break as changes are made upstream

6. Data Engineers inevitably must step in to fix the mess

7. Data Engineers begin getting treated as middle-men

8. Technical debt builds up rapidly — a refactor is the only way out

9. Teams argue for better ownership and a ‘single throat to choke’

10. Critical Production systems in the cloud (ML/Finance) fail

11. Blatant Sev1’s impact the bottom line, while invisible errors go undetected

13. The data becomes untrustworthy

14. Big corporations begin throwing people at the problem

15. Everyone else faces an endless uphill battle

At this stage, data can not answer the most basic business questions. There are layers upon layers of features added on top of untrustworthy data that nobody is taking ownership of, and hence the business can not scale nor move at the speed it requires.

The Concept

Data contracts address the complexities inherent in producing operational data and making it available to downstream teams for the creation of impactful business data products. These contracts serve as a foundational element in clarifying data ownership and maintenance responsibilities by establishing who is accountable for data upkeep to ensure that the data remains trustworthy and secured.

Furthermore, data producers get increased awareness of downstream usage by jointly developing a well-defined quality specification with the data consumer for how the data should be structured, versioned, managed, and maintained. This in turn brings a clear understanding of how changes to the data might affect downstream applications.

Data contracts work like the APIs of data, they are agreements between data producers and consumers, defining the structure, format, quality, and terms of use for data exchange, ensuring reliable and consistent delivery for data consumer teams, such as analytics and ML. They improve data quality and promote efficient data governance, especially in complex data ecosystems.

When a change in a contract is necessary, it initiates a discussion similar to a Pull Request (PR). In this scenario, the Data Producer team acts as the initiator, proposing the change, while the Data Consumer team takes on the role of the reviewer. This initiates a negotiation process. Once both parties reach an agreement, the Data Producer updates their PR, ensuring that it addresses the Consumer’s requests while also meeting operational requirements. Subsequently, a revised contract is established, incorporating the agreed-upon changes. This allows the Data Consumer to be aware of the impending changes and prepare for them effectively.

Having a contract prompts mandatory, proactive conversations between the involved parties before meaningful changes can be made to the data. This fosters an environment of ownership and collaboration.

The Caveats

While this approach sounds promising, does it mean that all problems are instantly resolved? The simple answer is no.

Applying quality, management, and governance universally is not feasible, and it’s impossible to satisfy every data consumer at all times. Data producer teams face the substantial challenge of adhering to contract specifications and Service Level Agreements (SLAs). However, it’s crucial to develop these contracts with a focus on the consumer, especially for data products that have significant business impact and require stringent data quality and governance.

The Pareto principle offers valuable guidance in decision-making, particularly in determining which data products to support. We should ask: which 20% of data products contribute to 80% of the business impact? These high-impact products are the ones that require robust quality, management, and governance to ensure the data remains trustworthy.

Building data contracts into a data product’s life cycle achieves this by fostering growth, and offering flexibility to accommodate change, with constraints focused on preserving business value.

The Data Product Life Cycle

Every data product requires a process of experimentation, involving trial and error, and essentially, a proof of concept. This is crucial for organizations to determine whether the proposed ideas will positively impact the business as anticipated. The process should unfold as follows: The data producer team develops a production pipeline tailored to the needs of the data consumer, who is working on an innovative data product. During this phase, the data consumer explores and iterates on the data, culminating in the creation of a Minimum Viable Product (MVP) that could be a report, dashboard, or ML model. The MVP’s role is to validate the proposed value (the specifics of which will be discussed in another blog post). Only if the MVP successfully demonstrates its value should a contract be established to support the specific use case.

The cycle of development should, and indeed must, continue as long as each iteration demonstrably adds business value. With each cycle proving its worth, iterations and modifications to the contract are permissible to accommodate and support the evolving functionalities of the data product.

The Market Reaction

The promise of solving so many issues has made Data Contracts gain widespread attention in the data industry.

A testament to the growing interest in data contracts is the significant investment of $7 million in Gables.ai in September 2023, led by CEO Chad Sanderson, a prominent figure in the field. Sanderson’s influence in the industry is evident, having cultivated a substantial following on LinkedIn of approximately 75,000 since early 2022, and establishing a popular Substack newsletter, Data Products, which boasts over 13,000 subscribers. Gables.ai has been at the forefront of advocating for data quality and governance, with a particular emphasis on data contracts. Everyone is eagerly waiting for their beta product release, hoping to integrate the new built-in data contract solution into their systems.

DBT Labs has established itself as a prominent advocate of Data Mesh, a decentralized management architecture comprising domain-specific data. Their commitment to this approach was evident at Coalesce 2023, where they actively promoted Data Mesh and its associated features. A key development in this area is the ‘model contract’ or data contract. Discussion and development around this concept have been ongoing since early 2023, as evidenced by the activity in the dbt-core GitHub repository. The built-in functionality for enforcing data contracts was initially integrated in dbt-core v1.5. Since then, it has not only become a core feature but has also seen significant refinements and enhancements, as seen in the latest v1.7.

It’s interesting to observe that DBT’s involvement goes beyond just integrating this new data contract feature into their platform, which helps them maintain their leading position in data management, quality, and governance. Additionally, the founders of DBT Labs have taken a step further by becoming angel investors in Gables.ai. This move hints at a strategic collaboration that could be reminiscent of early-stage partnerships like that of Microsoft and OpenAI. While it’s still early to predict the full scope of this partnership, it certainly opens up intriguing possibilities in the data management landscape.

Closing Thoughts

The emergence and adoption of data contracts in the data industry mark a pivotal shift towards more efficient and reliable data management practices. However, this is not a panacea for all data management challenges. The implementation of data contracts requires a balanced approach, respecting the Pareto principle and focussing on the most impactful data products. Quality, management, and governance remain central to this process, ensuring that the most critical data receives the attention it deserves.

While data contracts represent a significant advancement that addresses the complexities of modern data ecosystems, their success hinges on a nuanced, consumer-centric approach that prioritizes impactful data products and fosters a culture of collaboration and continuous improvement with data producers. As the industry evolves, so must our strategies to ensure that we are not just managing data but empowering businesses to harness their true potential.