Data Contract 101

Jean-Georges Perrin
ProfitOptics
Published in
7 min readSep 10, 2023

--

A quick and not-so-dirty introduction to data contracts

Data Contract
A data contract acts as an agreement between multiple parties; specifically, a data producer and its consumer(s).

Ok, that’s it, done my job. You know everything about data contracts. Not quite, I guess… In this article, I will dig a little deeper. You will learn about use cases, why they build trust, and how our industry got here. I will then talk about open standards and conclude with some examples.

A quick story about data scientists

Imagine that Beth, a data scientist in your organization, wants to access some applicants’ data from the company’s HR package. As she wants to run models, accessing data via API would be too slow and not resource-efficient. She needs a data pipeline that will extract the data from the HR solution, transform it into something she can use, and load it in a, let’s say, lakehouse. So far, nothing surprising, right?

Let's help Beth, our data scientist, find, access, and use data in a more efficient way.

However, as she looks at the data, she sees multiple emails and phone numbers. The transformation process anonymized the applicants’ names, as expected. But where does she find the information about the datasets she needs, which fields are anonymized, when the data is available and updated, and more?

She can definitely find the information in a wiki, Confluence, SharePoint, or another system. Still, we all know that documentation is a pain to maintain, usually falls behind and that when it comes to measuring service levels, it is quasi-inexistent.

That’s where the data contract comes in. As you will see in the rest of this article, it is much more than a documentation tool.

So, what’s a data contract?

As you saw from Beth’s story, the data contract:

  • Creates a link between data producers and data consumers.
  • Creates a link between a logical representation of the data and its physical implementation.
  • Describes “meta meta” data: rules, quality, and behavior (yes, there are two metas in this sentence).

Let’s dive in!

Data contracts build trust

Data consumers, like Beth, are doubtful about the stability of the data they find, and — very often — the first thing they do is make a copy of the data for themselves. One of their reasons is that they don't know if the data will be there tomorrow… or the day after.

To create trust, the data producer or the data owner needs to show and guarantee a promise.

History

The term “data contract” may be relatively new (although it creates much confusion with cell phone contracts). Still, the concept behind it and its usage is not new.

In the late 80s and 90s, CASE tools (computer-aided software engineering) used enriched metadata from relational databases to generate static code and build business applications. Popular tools like Visual Basic (as early as its version 3) already had such features.

In the early 2000s, new frameworks like Awoma ThinStructure or JBoss Hibernate leveraged dynamically generated and enhanced database schemas to help data-related applications (as in, is there such a thing as a non-data-related application?).

The rise of Big Data and the dump-all-you-want in my data lake period paused the evolution of those tools, focusing on centralized everything, creating never-ending and constantly over-budget projects. The 2010s showed us incredibly complex systems with limited results.

More and more companies are switching to a federated (and not decentralized) organizational model. Consumer teams on the ground (on the factory floor) are no longer passive participants; they share feedback, define local policies, make better decisions based on better metadata, and more. Is it working? It’s too soon to tell, but I don’t see the flaws.

As I wrote, data contracts (or whatever their names are) can empower teams, enhance overall governance, and decrease the time to market new applications.

Requirements & implementation

I have seen and heard about so many forms and shapes a data contract could take: some companies do a formal contract, including signatures at the bottom, free-flow Excel sheets, Word documents, Python… In all honesty, I don’t know which one of those formats is the worst.

For me, a data contract should be:

  • Easily read, interpreted, and enhanced by a computer.
  • Easily read, interpreted, and enhanced by a human (they still matter).
  • Version controlled.

YAML, a common file format used in software engineering, fits those criteria.

Benefits of an open standard

Of course, having a file format is not enough. The content of this file and its structure matter. If three vendors defined three different versions of HTML, would the web have been such a success?

Regarding data contracts, the need for standardization is even stronger. As you will see in the next section, a data contract has different parts fulfilling different needs: data quality can be evaluated by one vendor, another could focus on the history of stakeholders, a third will analyze service-level objectives (SLO)… Human creativity becomes the limit.

If we had disparate standards, a lot of information would be lost, and time would be wasted.

That’s the reason why PayPal originally open-sourced the contract and why a non-profit, user-driven organization like AIDA User Group assumed responsibility to continue to develop, nurture, and foster the standard. Companies like ProfitOptics also understand the benefits and are helping customers implement and deploy data contracts.

As it is an open standard, please join the working group. Contact us via GitHub.

Data contracts help communication. Communication avoids conflicts. Conflicts are bad. Data contracts are good.

Architecture & implementation

Data contracts follow a standard called Open Data Contract Standard (ODCS), and its current version is v2.2, with a lot of work going into v2.3, which will be upward compatible. ODCS embraces other standards and best practices like semver (Semantic Versioning), Kubernetes naming convention for YAML, and even idempotency.

The contract covers eight categories. Let’s go through them.

Demographics

This section contains general information about the contract, like name, domain, version, and much room for information.

Dataset & schema

This section describes the dataset and the schema of the data contract. It is the support for data quality, which I detail in the next section. A data contract focuses on a single dataset with several tables (and, obviously, columns).

Data quality

This category describes data quality rules & parameters. They are tightly linked to the schema defined in the dataset & schema section. Check out my 2018 piece on Data Quality.

Pricing

This section explains pricing if/when you bill your customer for using this data product. Pricing is currently experimental.

Stakeholders

This important part lists stakeholders and the history of their relation with this data contract.

Roles

This section lists the roles that a consumer may need to access the dataset depending on the type of access they require.

Service-level agreement

This section describes the service-level agreements (SLA). Data and SLA are unfortunately not documented enough just yet. Stay tuned for more.

Custom properties

This section covers custom & other properties in a data contract using a list of key/value pairs. This structure offers flexibility without requiring the creation of a new template version whenever someone needs additional properties.

Figure 1 shows the eight categories, as well as the stakeholders. In a future article, I will focus on the stakeholders, tools, and integration.

The eight sections of a data contract and its stakeholders (Source: OCDS)

Examples

I know it’s been a lot of theory so far, but you have obviously stuck with me. Let’s have a look at a few examples.

A column in a table

In this first example, you can see that the data contract defines a column called “txn_ref_dt” coming from a table called "tbl."

As you can see, the contract details the column's logical and physical types. In this example, they are the same, but it will not always be the case.

- table: tbl
description: Provides core payment metrics
dataGranularity: Aggregation on columns txn_ref_dt, pmt_txn_id
columns:
- column: txn_ref_dt
businessName: Transaction reference date
logicalType: date
physicalType: date
description: Reference date for the transaction. Use this date in reports and
aggregation rather than txn_mystical_dt, as it is slightly too mystical.
sampleValues:
- 2022-10-03
- 2025-01-28

Authoritative definitions

Another key feature of the data contract is its ability to play well with others. The notion of authoritative definition plays a critical role. In the following example, the column "rcvr_cntry_code" is defined in Collibra as a specific asset. As this column results from a transformation, the reference implementation is in GitHub, and the contract user can find all about it. Knowing the authorities is one of the keys to computational or data governance.

- table: tbl
columns:
- column: rcvr_cntry_code
businessName: Receiver country code
logicalType: string
physicalType: varchar(2)
authoritativeDefinitions:
- url: https://collibra.com/asset/748f-71a5-4ab1-bda4-8c25
type: Business definition
- url: https://github.com/myorg/myrepo
type: Reference implementation

Going forward

I hope I convinced you of the importance of data contracts. They quantify trust, provide great documentation, can evolve, and so much more.

What's next for the data contract: Stay tuned! Subscribe to this publication and follow me. We have a lot more coming.

What's next for you: Experiment, build your own data contract on a simple dataset, understand its benefits, and never forget that ProfitOptics can help.

More resources

· Implementing Data Mesh, Perrin & Broda, O’Reilly, 2023/2024.

· Driving Data Quality with Data Contracts, Jones, Packt, 2024

Photo Credit

--

--

Jean-Georges Perrin
ProfitOptics

#Knowledge = 𝑓 ( ∑(#SmallData, #BigData), #DataScience U #AI, #Software ). Lifetime #IBMChampion. #KeepLearning. @ http://jgp.ai