A Technical Guide to Data Contract from conceptualisation to implementation

Published in

Agile Lab Engineering

18 min readApr 22, 2024

Data consumption can be regulated through data contracts by embedding all the necessary information to guarantee a data consumer a good data experience and at the same time to let a data producer control the right usage of the exposed data.

Data contracts have a big potential from a data governance perspective because they aim to regulate producers and consumers of data through a supplier vs customer relationship.

Concepts and relationships

A data model represents the meaning, structure, relationship, constraints, and attributes of the data stored in a database or information system.
A data contract is an agreement for data consumption.
A data interface is the means for data consumption and it hides implementation details and provides guarantees to consumers.
A data property defines a specific attribute of the data interface.
A data guarantee is an expectation on a data property.
A Service Level Agreement (SLA) is a set of data guarantees.
A data consumer is an identifiable agent, individual, or persona consuming data from a data interface.
A data subscription is a data consumer declaration of a data contract SLA to sign off and an intended usage of the data.
A data producer is an individual or team dedicated to curated data generation. A data producer must generate data in compliance with at least a data contract.

Data Interfaces

A data interface is any means to serve data. Web APIs, RCPs, file-based exchange and any other convenient choice are valid possibilities.

Data Properties

A data property is an attribute of the data interface.
Examples of data properties are as follows:

Ownership: data owner, data team, etc.
Schema: Columns (Name, Type), Primary Key
Partitions
Data Quality: Timeliness, Validity, Integrity, Completeness, Consistency, Accuracy
Data Service: Availability, Throughput, Latency
Data Governance: Classification, Categorization, Data Privacy
Physical Address, Protocol, Format
Data Security on Storage and Transfer
Cost Unit Metrics

Guarantees

A guarantee is an expectation on a data property.
For instance:

Timeliness < 12h
Availability > 99.9%
Completeness > 90%

Guarantees are meant to satisfy a certain class of consumers.

Service Level Agreement

A Service Level Agreement (SLA) is a set of guarantees that outlines the level of service expected by a data consumer from a data interface.
A data producer is accountable for the SLA declared.
A single data interface may be connected to several SLAs.

For instance, a data API could sustain a certain number of API calls per minute (SLO) but it does not mean it should promise the same level of service for all data consumers. It could allocate a certain limited unguaranteed throughput for API calls of data consumers who doesn’t pay or need less and dedicate guaranteed throughput to mission-critical or paying data consumers.

Thus, SLAs are attached to data consumers through a data subscription related to a specific data interface. Each SLA contains a set of guarantees with proper (thresholds, auto-scaling, availability, etc.) limits.

Data Subscription

Indeed, a data subscription finalizes a data contract from a consumer perspective.

It contains a reference to the data interface and a specific service level agreement.

Intention of usage

A data subscription also contains the intended usage of the data.

For instance, consider the marketing domain willing to consume customers' data for internal analytics. Those analytics are not intended to be used for marketing campaigns and they are not formally subjected to the customer’s data privacy consent related to marketing activities.

Nevertheless, the marketing domain would like to consume such data without restrictions due to data privacy regulations.

Usually, these kinds of exceptions are regulated through emails or are difficult to automate. A data subscription can be subjected to orchestrated approval but they enable automation since they are represented as metadata exactly as the rest of the information on a data contract.

The intention of usage may contain declarations like:

Frequency of data consumption
Default column and row filtering
Data privacy utilization

and any other behavior of the data consumer relevant to the data producer to regulate the data consumption.

Negotiation

A data contract is an agreement between a producer (supplier) and multiple consumers (customers).

Publish

A data producer is responsible for publishing the data interface (technical specification) and a set of available SLAs on specific guarantees to a data contract store.

A data consumer is responsible for publishing the data subscription to the data contract store where it declares the requested data interface at a specified SLA (unless a default exists).

A data subscription may be subjected to approval by the data producer depending on the data access level of the consumer and its intention of usage.

Approval

The approval can be automatically granted or may need manual intervention based on the specific use case, environment, business process, etc. The approval comes from the validation of governance policies established at the company level (see later).

Versioning

A data contract should be versioned. Versioning implies that changes to the data product must be available for inspection from data consumers. A data subscription refers to a specific version of the data contract. This mechanism allows for safely applied change management and keeps stable data interfaces for data consumers.

A data contract can change without affecting data consumers. For instance, enhancing an SLA by increasing the data availability cannot affect negatively data consumers. Such improvements can be considered forward-compatible. A data interface removing a column from the schema, changing a column type, or changing the protocol used for data transfer can significantly impact data consumers. For that reason, they must be treated as breaking changes.

Another relevant case of breaking change can happen whenever the data model of a data contract changes. A change in the data model could change the meaning of a field that may keep having the same name and type. For instance, put the case where the marketing domain changes the meaning of the data model for a marketing qualified lead (MQL). This means having different data for the same data model that has changed drastically even if the data interface is technically the same.

Another example is as follows. Due to a dispute between the national railroads and the rail transport companies, a law was enacted clarifying the area within which a train can be considered to have arrived at the station even if it is neither stationary nor ready for passengers to disembark.

In this case, the arrival time for a train does not change in any physical schema (at the data interface level) but it changes the meaning of the column that keeps the previous name according to its data model.

These breaking changes at the data model level should imply a data contract with a breaking change otherwise data consumers taking data from the data interface may not clearly understand the difference in the meaning (reflected in arrival time values) between rows recorded before and after the data model change.

Versioning moves through environments (dev, test, prod) as usual.

Data Model and Data Contracts Versioning

Lifecycle

A data contract should have a managed lifecycle from ideation to dismission. The lifecycle depends on the data organization, business, and process that a company would like to address. This paragraph provides an example of lifecycle stages.

Design

A data contract only represents an agreement. This agreement may happen even before the implementation of the underlying data happens.

Prototype

A data contract could be available for data consumers without a real implementation but exposing a live data interface with sample data.

This would allow the exposure of the data product before a real implementation to explore the interest in the data model and usability of the data interface (market exploration).

Live

A data contract is backed by a real implementation of the data interface.
Data guarantees are also implemented (data quality, security, availability, etc.)

Deprecate

A data contract is marked as deprecated meaning that provides limited support and will be dismissed soon. Data consumers are invited to move to newer versions of the data contract.

Retire

A data contract has been retired. No newer versions will be delivered. Data consumers cannot consume any data from it.

The lifecycle of the data contract should be subjected also to data governance. For instance, a company could decide to deprecate data interfaces not compatible with open standards to reduce vendor lock-in. Data contracts enable automated data governance and are particularly useful for defining customer/supplier relationships in this case.

Lineage

Data contracts regulate producers and consumers through a pub/sub pattern. It is possible to create a lineage based on data contracts.

This requires an additional set of relationships:

The data model dependency shows how a data model depends on one or many other data models
The data interface dependency links a specific data interface with data interfaces of the respective data models they depend on.

A data contract container (see later) defines the technical boundary to keep these relationships clear.

A graph of dependencies between producers and consumers has many benefits from a data value chain perspective, for instance:

Clear ownership: data contracts determine the e2e ownership of the data;
Data quality impact: SLAs on data quality allow the investigation of the impact of poor data injected into the data value chain;
Data “cost” chain: data subscriptions are associated with a cost of consumption. Downstream data contracts accumulate chargeback costs from upstream data contracts depending on their subscribed SLAs.
Auditing: since data subscription allows data consumers to explicitly declare the intention of usage, it is possible to inspect the actual usage of the data produced from a specific data subscription.

The data contract lineage opens many opportunities for automated governance and is key to capturing a complete view of the graph (network effect, data value chain, etc.).

Semantic lineage (data models) vs data contract lineage

Governance

Governance means making decisions on how to regulate data contracts.

Policies

A governance committee emits policies to regulate a certain behavior, for example:

A data subscription can be automatically approved if it declares the intention of usage;
A data contract can be published only if it declares at least schema, data quality, and semantic classification
A data contract must publish at least a default SLA
A data subscription can be approved only if its intention of usage matches the data availability declared in the data contract. This validation could be delegated to a specific engine.
A data contract must provide a cost estimation of data consumption based on the intention of usage (estimated chargeback). For instance, a data subscription schedules a workload to consume data from a data interface three times a day. It requires a full extraction and there is no filter applied. Depending on the data interface, the cost of serving data as requested by a subscription can be different. For instance, the data interface is a view on a PostgreSQL installed in IaaS. A full extraction requires PostgreSQL to serve data and filters could help save memory utilization at the cost of small computation. In this case, cost estimation is challenging. In the case of a file-based data interface (plain parquet on cloud object storage), a full extraction without filtering is dependent on the consumer workload. The chargeback is independent of the data producer and can return an estimation based on the cloud service spending.
A data subscription could be negated if it generates unwanted circular dependencies on the basis of the lineage information
A data availability validation. For instance, for a stateless streaming data pipeline regulated by data contracts, the availability of a data interface downstream cannot be greater than the availability of data interfaces upstream.

Policies can regulate:

Data contracts
Data subscriptions
Data contract lineage
Data models

and anything that implies a relationship between data producers and consumers.

Since data contracts regulate the relationship between producers and consumers, they are intrinsically a governance tool.

Governance Driven Development

Data contracts and subscriptions must be designed and implemented as everything in data management.

They should evolve in agility to allow the sustainable onboarding of new data contract and subscription features keeping a good time to market and control over data consumption.

This development implies familiarity with metadata design and management.

A general approach to deal with the data contract design is the governance-driven development (GDD) that consists of the following cycle:

Concern identification: identify the governance issue.
Information definition: determine which piece of information resolves this issue;
Metadata coding: implement machine-readable metadata to code such information.

For instance, what’s the concern? data consumers cannot find anyone responsible for data quality. What information can provide an answer? We should enrich data with ownership. How can I code ownership? I could code ownership through the employeeID of the data owner. Data owners are assigned an LDAP group.

GDD has a broader impact than data contracts only. In fact, anything that belongs to data governance should be developed through this approach.

Policy Management System

A data contract store contains data contracts and subscriptions to be governed. Having a data contract in place does not imply any governance in place. Data contracts can be governed through the following elements:

A policy store: a place where to store policies;
A policy engine able to ingest policies and deny/allow an action;
Data contract connectors. There may exist multiple data contract stores or formats to be managed by a policy engine. For instance, data APIs could be expressed through Open API while other data contracts could rely on a company's custom data contract specification. Unfortunately, there are dozens or hundreds of data contract specifications that are self-proclaimed standards but have neither community, popularity, nor adequate adoption to be so.
Metadata/data connectors: the policy engine must validate policies against facts. Facts can be gathered from metadata or data sources. For instance: the last computed data quality metrics, last measured availability, organization units in LDAP to check ownership and user categorization, etc. Other important sources of metadata that matter are lineage and versioning information. The lineage can enable great control over the chain of dependencies while versioning can allow for fine-grained control over backward and forward compatibility of data contracts and subscriptions.
BPM integration: big enterprises could need the integration of complex business processes or manual approvals to grant authorizations. This is usually due to a strong siloed organization (segregation of duties) rather than technical limitations.
Scheduler: necessary to plan policy execution;
Event-based triggering: the policy engine must be able to validate policies against certain events. Exposing APIs to trigger and orchestrate policy execution is usually necessary for ordinary enterprise-grade scenarios.

Data Contract Container

A live data contract is backed by a running implementation of the data interface. A data interface is fed by data generated by any workload.

Data contracts do not define which data architecture is necessary to deliver data through a specific interface. A data contract is meant to regulate supplier/customer relationships. For this reason, in terms of data contracts, any implementation backing a data interface is just a detail.

Nevertheless, a real implementation is necessary and this is why it makes sense to define an architecture to contain a data contract. An data contract container (DCC) is a logical unit that embodies all the necessary physical pieces to implement a data contract. A DCC is context-specific, this means that the specific architecture, archetype generalization, specific components, and functional boundaries of an architecture container depend on organization, data management practice, data modeling management, etc.

Examples of DCCs are data products whatever meaning the reader wants to provide (I do not fight with data as a product/data product challenge) and this applies to any data practice (data mesh, data lake, lake house, DWH, etc.). Whenever there is a data interface to be guaranteed through an SLA, a data contract makes sense of it and the architecture container is the place where to implement the data contract.

Datasets in a data lake

A dataset in a data lake represents raw/basic data cataloged and available for further usage. Well-governed data lakes account for data access management, classification, and data quality. A dataset must be provided by a data interface. Ingestion, cleaning and any form of normalization to produce a dataset belongs to a data contract container. Any guarantee necessary on top of a dataset should be represented in a data contract.

Data lakehouse

A lakehouse organizes data into a layered architecture and each layer addresses a specific refinement of the data. Besides the kind of architecture chosen to implement the specific lakehouse, each dataset delivered at any layer must be consumed through a regulated data interface. For instance, a medallion architecture may want to prevent final users from consuming the bronze layer for any reason. The gold layer (aka serving layer) would need fine-grained control over data access, data privacy regulations, and so on. A data contract container accounts for any ETL, aggregation, and data processing workload to build a dataset at any layer to be served exclusively through a regulated data interface (data contract).

Data Mesh — data as a product

Data as a product in the data mesh context is defined as an architectural quantum. It can contain multiple output ports.

This definition maps with an architecture container owning multiple output ports and a single data model by which it is possible to define a clear boundary context. A single data model should not be shared by multiple data products.

A data product must provide a set of affordances to data consumers such as discoverability, addressability, trustworthiness, security, interoperability, and business significance. Thus, we can map affordances with guarantees and set output ports as data contracts.

Data Sharing

Data sharing enables different parties to exchange data within the same organization or among several organizations. This data practice opens immense opportunities and concerns at the same time. Data contracts perfectly fit this purpose since they can regulate the exchange of data through interfaces and subscriptions with a clear intention of usage.

A dummy implementation

This section describes a rough idea of implementation to make the reader closer to the concept.

Data Contract Format

YAML is human-friendly and machine-readable, broadly adopted, and can be integrated with any technology. This is the format to start building a data contract with.

Besides syntactic validation of the YAML content, any other logical or semantic validation can be considered a policy related to the data contract format.

The data contract contains:

Data interface: a table reference (schema and partitions) within a Hive-based metastore;
Properties: Data quality metrics and rules to be applied to the data interface.

---
interface:
- owner: johndoe@mycompany.com
- type: hive
- database: mydatalake
- table: mytable
  - name: string
  - amount: number
  - category: string
- properties:
  - dataquality:
    - rule1: category in ['categoryA', 'categoryB']
    - rule2: amount > 0
- SLAs:
  - basic:
    - default: true
    - frequency: 1h
    - rules:
      - rule1: 90
      - rule2: 100

Data Subscription Format

Similarly, a data consumer subscription can be stored as a YAML file.

#subscriptions.yaml
---
owner: maryamgutkowski@mycompany.com
interfaces:
- mydatalake:
  - mytable:
    SLA: basic
    usage:
    - frequency: 6h
    - filter: category = 'categoryA'

Data Contract Store

A git account is fair enough. Put data contracts and subscriptions within a git repository. Proper branching and tagging strategy can easily bring the value of data contract versioning.

Data Contract Container Deployment

A deployment pipeline attached to the git repository of a data contract container can be responsible for the following idempotent actions:

Preparation of the object store
Hive-Metastore configuration of the table
Data Quality APIs published to the API store and a service handling the following standard endpoints:

GET {env}.mydatalake/mytable/{version}/dataquality/rule1
PUT {env}.mydatalake/mytable/{version}/dataquality/rule1?coverage={number}
GET {env}.mydatalake/mytable/{version}/dataquality/rule2
PUT {env}.mydatalake/mytable/{version}/dataquality/rule2?coverage={number}

Ingestion Workload to extract data from one or multiple data sources, transform and load data to the Hive table.
Data Quality Workload to compute metrics as described in the data contract. This workload updates the DQ metrics calling the data quality service through the HTTP method PUT.

This article addresses only the data consumers vs data producers relationship hiding the details of the implementation.
A data consumer only cares about the technical specification for data consumption and data semantics, a data producer wants to make sure that a consumer is making proper usage of the data interface.

Data Contract Deployment

A data contract deployment pipeline sets the policy engine to check the validity of the guarantees declared in the SLAs.

The policy engine sets a data quality policy check scheduled with a 1h frequency. The policy execution checks data quality calling the DQ services at the endpoints corresponding to the HTTP method GET of the respective DQ rules.

Data Subscription Deployment

A data subscription deployment pipeline sets the policy engine to check the validity of the intended usage.

The policy engine sets a policy to check whether the data consumption frequency exceeds the declared one.

In this case, there is no need for approval, and access to the data interface is automatically granted.

At the same time, the ingestion workload must update a log of data consumption to let the policy engine check the frequency.
How to manage workloads is out of the scope of this article.

Policies

Policies can be written in any language. This sections report an example with CUE/Go.

The following CUE snippet represents the schemas for the data contract and the data quality service response.

// data_contract.cue
DataContract: {
    interface: [{
        name:     string
        type:     string
        database: string
        table:    string
        properties: {
            dataquality: [{
                rule1: string
                rule2: string
            }]
        }
        SLAs: [{
            basic: [{
                default:   bool
                frequency: string
                rules: [{
                    rule1: string
                    rule2: string
                }]
            }]
        }]
    }]
}

// Define schema for the HTTP response
CoverageResponse: {
    coverage: int
}

While this snippet checks the validity of the data quality:

// dataquality_policy_rule1.go
package main

import (
    "encoding/json"
    "fmt"
    "io/ioutil"
    "log"
    "net/http"

    "cuelang.org/go/cue"
    "cuelang.org/go/cue/cuecontext"
    "cuelang.org/go/cue/load"
    "cuelang.org/go/cue/parser"
)

func main() {
    // Parse the CUE files
    instances := cuecontext.New().Instances([]string{"data_contract.cue"})
    if instances.Err != nil {
        log.Fatal(instances.Err)
    }

    // Get the YAML data
    dataContract := instances[0]

    // Read the YAML file
    yamlFile, err := ioutil.ReadFile("data_contract.yaml")
    if err != nil {
        log.Fatalf("Error reading YAML file: %v", err)
    }

    // Validate the YAML data
    val := dataContract.ValidateYAML(yamlFile)
    if val.Err() != nil {
        log.Fatalf("YAML validation error: %v", val.Err())
    }

    // Unmarshal the YAML data
    var data map[string]interface{}
    if err := yaml.Unmarshal(yamlFile, &data); err != nil {
        log.Fatalf("Error unmarshalling YAML: %v", err)
    }

    // Make HTTP GET request to the endpoint
    resp, err := http.Get("https://{env}.mydatalake/mytable/{version}/dataquality/rule1")
    if err != nil {
        log.Fatalf("Error making HTTP request: %v", err)
    }
    defer resp.Body.Close()

    // Parse the response body
    var response CoverageResponse
    err = json.NewDecoder(resp.Body).Decode(&response)
    if err != nil {
        log.Fatalf("Error decoding JSON response: %v", err)
    }

    // Validate the condition using the parsed YAML data
    rule1, ok := data["interface"].([]interface{})[0].(map[string]interface{})["SLAs"].([]interface{})[0].(map[string]interface{})["basic"].([]interface{})[0].(map[string]interface{})["rules"].([]interface{})[0].(map[string]interface{})["rule1"]
    if !ok {
        log.Fatal("Invalid YAML structure: rule1 not found")
    }

    if coverage, ok := response["coverage"].(int); ok {
        if rule1Value, ok := rule1.(int); ok {
            if rule1Value > coverage {
                fmt.Println("Rule1 is greater than coverage.")
            } else {
                fmt.Println("Rule1 is not greater than coverage.")
            }
        } else {
            log.Fatal("Invalid YAML structure: rule1 is not an integer")
        }
    } else {
        log.Fatal("Invalid response structure: coverage not found or not an integer")
    }
}

Policies can be stored within a git repository and should follow software development best practices including versioning and testing.

Policy engine

A rudimental policy engine can be combined with the following elements:

CUE/Go run-time environment
A scheduler like crontab
An event-based mechanism to run CUE/Go remotely (for instance, ssh to a remote machine running CUE/Go).

A policy engine can act at design time (data contract deployment), run-time (scheduled or event-based), or deployment time (data contract container deployment).

Design time. At design time we can check the validity of the contract, formal backward and forward compatibility, versioning, etc. This happens when deploying the data contract. If the validation is successful, run-time rules can be configured.

Deployment time. At deployment time, it means we are deploying the data contract container. This allows us to check that the implementation of the data contract matches the declaration.

Run time. At run-time, the policy engine executes policy validation running scheduled executions or responding to specific events.
For instance, data quality SLAs can be validated every hour (as described in this example) or this validation can be linked to an event of successful termination of the data quality job perhaps by we hook or any sort of callback mechanism.

Final Thoughts

This article shares the data contract as a technical concept with a linkage to the semantics given by the relationship between a data interface and its corresponding data model. As we have seen, a data model can be consumed by multiple data contracts. This interpretation is a choice, other options are possible that change the balance between semantics and technical boundaries. For instance, you could consider a data contract able to contain multiple data interfaces. In this case, the data contract could match the semantics boundaries and keep all the data interfaces sharing the same data model. This choice is legit and makes sense. In this case, revisit this article under this particular lens and the storytelling of this article should change significantly. Nevertheless, the working principles for data contracts stay.

The terminology used within this article deviates from the common literature on purpose. This is because there are overlaps with common concepts but there are also many novelties that I didn’t want to mix with the rest. Thus, this language sets a mindset. I’ll leave the reader to reconcile those concepts with the rest.

This article didn’t mention data product descriptors or similar artifacts since these concepts overlap but differ from data contracts. Thus, you will find some pieces missing, like how to manage workloads. I considered it out of scope for this article but feel free to ask questions and I’ll provide some guidance afterward.

Finally, I’ve introduced how to link the data model to data contracts and how to manage data contracts as a bidirectional negotiation, that is, data consumers and producers can expect some guarantees from each other.

DISCLAIMER. The dummy implementation is not delivered anywhere to my customers and it’s not recommended for production environments.
Of course, things are always much more complex. The purpose of this example is to solely touch these concepts with your own hands.

A Technical Guide to Data Contract from conceptualisation to implementation

Concepts and relationships

Data Interfaces

Data Properties

Guarantees

Service Level Agreement

Data Subscription

Intention of usage

Negotiation

Publish

Subscribe

Approval

Versioning

Lifecycle

Design

Prototype

Live

Deprecate

Retire

Lineage

Governance

Policies

Governance Driven Development

Policy Management System

Data Contract Container

Datasets in a data lake

Data lakehouse

Data Mesh — data as a product

Data Sharing

A dummy implementation

Data Contract Format

Data Subscription Format

Data Contract Store

Data Contract Container Deployment

Data Contract Deployment

Data Subscription Deployment

Policies

Policy engine

Final Thoughts

Written by Ugo Ciracì