“Walk the talk” principle of data contracts

--

About data contracts

According to Monte Carlo system provider data contracts might cover things like:

  • What data is being extracted
  • Ingestion type and frequency
  • Details of data ownership/ingestion, whether individual or team
  • Levels of data access required
  • Information relating to security and governance (e.g. anonymization)
  • How it impacts any system(s) that ingestion might impact

Monte Carlo also defines data contract as:

“an agreement between a service provider and data consumers. It refers to the management and intended usage of data between different organizations, or sometimes within a single company.”

The purpose of data contract according to Monte Carlo is “to ensure reliable and high-quality data that can be trusted by all parties involved.”

Data Mesh Manager is a tool created by developers of Data Contract Specification. On the tool’s site, they define the data contract as follows:

“A data contract defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and their consumers. A data contract is implemented by a data product’s output port or other data technologies. Data contracts can also be used for the input port to specify the expectations of data dependencies and verify given guarantees.“

Bitol which is project under Linux foundation exists with an attempt to standardize data contract. According to them data contract is:

“an agreement between a data producer and its consumers. It not only describes the data but also its expected behavior by advertising the data quality rules the data needs to obey, as well as service levels, stakeholders, roles, and pricing.”

Walk the talk

The saying “if you’re going to talk the talk, you’ve got to walk the walk” also known in modern versions of old sayings like “actions speak louder than words” and “practice what you preach.” Another early form of the expression was “walk it like you talk it.” Many people now condense this to “walk the talk,” which makes a sort of sense (act on your speech).

I propose this is also one of the fundamentals of data contracts. Instead of just promising the customer that we will meet the requirements, there is proof as well.

In the below illustration is described fundamental differences of the “talk” and “walk”. For the talk driver is the customer needs, while on the walk driver is the data provider whose task is to provide the data matching the needs. The focus of “talk” is to define requirements and nature is administrative. On the “walk” side focus is on technology to enable data to flow, be able to monitor quality and also provide proof of quality. Purpose of “talk” is to declare, make statements which are results of customer negotiations. On the “walk” side purpose is to enable provider to match with the statements. Function of the “talk” is to express commitment, while on the “walk” it acts as technical agreement to validate the commitment. As a result we have high trust in the data.

It is worth to mention that trust to data is not only built on top of data quality which has been the (valid) starting point of data contracts. Compliance with the regulations such as the GDPR in Europe and the CCPA in California, among other global data protection laws is vital for legal data handling practices and for trust. Data security is also one element in creating trust as we consumers want to be sure it is not tampered with. Maintaining high security is essential for building and preserving stakeholder confidence in data protection. Even the provider organization affects the trust or would you trust the data if the provider is found quilty in violating privacy laws, making frauds, or just being financially on the verge of bankruptcy. Admitted that not all the aspects of creating trust are not so easy to validate or even measure, so lets put that discussion on hold for now. Instead, let us focus on something that can be done with moderate amount of work and first discuss a bit more what is talk the talk and walk the walk.

Talk the talk — business requirements

In the process of crafting a data contract, certain requirements are agreed upon. Those items most likely contain data quality metrics to apply and threshold values to meet. Other items can be for example SLA related. This is the “talk the talk”. The result of talk is a set of declarations to which customer is satisfied and provider is committed to. This is also the state of “data contract at rest”.

When the data under the contract is taken into use, a customer wants to be sure of the quality level (not just data quality, but in a broader sense). As a provider, we also want to be sure we are fulfilling the promises or more like commitments (talk). This is when we need “walk the walk”.

Walk the walk — monitoring and proof

Walk the walk goes hand in hand with the “talk the talk”. As data provider we have agreed to provide data with quality of data and service that satisfies the customer and is good enough for their business purposes. As part of the data contract, we also define how to monitor and validate the contract content and commitment. Now it seems that Everything as Code is a prominent approach to apply in well…everywhere. In the data contract everything as code is the measuring rules provided. More about the “as code” in more details can be found from my previous posts (Data QoS as Code” fitted into the Open Data Product Specification trial and Data Quality of Service (DataQoS) as Code — Reusable Component)

These rules are tools compatible out of the box and can be injected directly to for example data quality monitoring tools such as Soda and Montecarlo. Likewise, the SLA related parts are defined as rules to be used as input for service quality monitoring services. The moment when data contract is taken into use in system level, the state changes and “data contract is in motion”. It is no longer some document left in the file folder or CMS system like the previously used data provisioning agreements and alike. Tempting as it sounds, data contract alone is not enough. We need to have the legal agreements as well for risk management purposes. Those legally binding agreements could be dynamically generated and some of the content comes from the data contract. But lets return now to data contract and not mix the legal side here too much.

The above described dualistic approach to have both requirements and measure methods, is visible, not yet in data contract standards, but for example in service level standard OpenSLO. In that model both threshold values (objectives as they call it) and rules to validate the objectives are defined in the same SLO contract.

Walk the talk in data contracts can be defined as:

Defining minimum threshold levels for agreed metrics between stakeholders (commitment) along with related tools compatible executable rules (validation) as one machine-readable file in order to create trust in the data and between the stakeholders.

Now lets expand the above illustration of Data Contract to include some of the above mentioned elements. In the below illustration yellow color boxes are items belonging to the “walk the walk” space. The left column is “talk the talk” space.

We previous discussed already the top four rows of the model and lets focus on the rows below them. Items on the yellow “walk the walk” space is the “as code” part of the contract, while the left side is fixed as talks.

In business data needs and purpose are described in the negotiations. Often this is needed in the legal agreements which we discussed already. On the other hand (“as code”) we need to have the technical description the data to be include and also means to validate it. Data quality indicators as well as the threshold levels are agreed on talks around the table and then method to validate those are defined as code on the “walk the walk” side. Same goes for the SLA part. The SLA can contain different kind of indicators in different data contracts. The machine-compatible rules as code are again defined to validate the agreed objective values. Next the access needs are agreed as talks and then written in code format to enable it and also to validate access. As last item I added pricing related information. Pricing and related restrictions on usage are agreed and after that rules to monitor and govern the usage is written as code as part of the data contract.

At the end resulting data contract is combination of talk the talk and walk the walk, all in one YAML file. Threshold values and multiple other static values are taken from the talks and merged with the “as code” part of the contract.

At this point we are ready to walk the talk. Also as mentioned especially when data is crossing some borders, like company borders, a legally binding document is often needed for risk management purposes. That is either generated or otherwise created and some of the content is taken from the data contract document.

--

--

Jarkko Moilanen (PhD)
Exploring the Frontier of Data Products

Open Data Product Specification igniter and maintainer (Linux Foundation project). Author of business-oriented data economy books. AI/ Product Lead professional