Data Contracts — Unlocking Data Harmony

Mennan Sevim
DataBulls
Published in
8 min readAug 20, 2024

— a digital handshake

In software development, communication, and data integrity are critical. Data contracts have become a significant component that provides numerous advantages in this realm. This article will focus on what a data contract is, what it is not, and how it is utilized in the software world.

A data contract is a formal agreement between the users of an originating system and the data engineering team tasked with extracting data for a data pipeline. This data is subsequently loaded into a designated repository, like a data warehouse, where it undergoes further transformations to align with the end user's specific needs.

In addition to specifying data structures, the Data Contract also defines how these data will be shared and utilized. However, it’s essential to note that this contract is not a database schema or an object-relational map. It delineates the format for data movement and sharing but does not delve into the details of storage.

In simple terms, the data contract is located in the layer as follows.

In the diagram above, the producer may be using version 2 of the data contract, yet the application in which the consumer resides expects version 1 of the data contract. In this case, the consumer will use a migration rule to transform the data so that it conforms to version 1.

A data contract is;

Formal Agreement: A data contract is a formal agreement in software development.

Structural Definition: It outlines the structures and types of data.

Usage Specification: Defines how the data will be shared and utilized.

Not a Database Schema: It is distinct from a database schema and does not serve as an object-relational map.

Focus on Movement and Sharing: Emphasizes the format for data movement and sharing without delving into storage details.

Data Governance: Include guidelines for governance, including data stewardship, ownership, and compliance with relevant regulations. Also, define responsibilities and accountability for data handling.

Data Lifecycle Management: Outline the data lifecycle, including creation, modification, storage, archiving, and deletion. Specify data retention periods and data disposal procedures in line with regulatory and business requirements.

Why are Data Contracts so important?

Data contracts are critical in data management and governance for several reasons. Here’s a detailed explanation of their importance:

Data Consistency and Accuracy:

Data contracts are like a set of rules that everyone in the data world follows. Both the folks who create data (producers) and those who use it (consumers) stick to these rules. This helps keep data organized and makes sure everyone understands it the same way.

By having these agreed-upon rules, we significantly reduce the chances of mistakes, confusion, or things not making sense in the flow of data. Think of it as a guide that says what kinds of data are allowed, how they should look, and any restrictions they might have. This guide is crucial for making sure the data we work with is of good quality and doesn’t get messed up.

Precision in Validation:

One of the great aspects of data contracts is their meticulous attention to detail. They precisely define acceptable data types, formatting requirements, and any applicable constraints. This thoroughness ensures that only compliant data is admitted.

Think of the data contract as a vigilant gatekeeper, meticulously checking each piece of data before entry. This process not only prevents incorrect or anomalous data from entering but also actively guards against errors and anomalies infiltrating our data environment. Consequently, it ensures our data remains reliable and trustworthy.

Safeguarding Data Quality:

Think of data contracts as the guardians of data integrity. They establish clear standards for data appearance and content, which is crucial for maintaining robust and dependable data.

Adhering to these standards ensures our data remains accurate, reliable, and primed for informed decision-making. Simply put, data contracts act like bodyguards for our data, ensuring it remains high-quality and consistent. They are essential for any organization aiming for precise, reliable, and high-quality data.

What Constitutes Data Contracts?

A data contract encompasses not only basic understandings of usage, ownership, and origin but also specific agreements regarding:

  1. Schema
  2. Semantics
  3. Service Level Agreements (SLA)
  4. Metadata (Data Governance)

Let’s delve into each aspect.

Schema

This refers to a set of guidelines and limitations applied to the data attributes and/or columns within a structured dataset. It assists in data processing and analysis by providing critical information.

A schema outlines the names, data types, and necessity of attributes. It can also specify the format, length, and acceptable value range for columns.

Schemas are subject to change due to evolving data sources or shifts in business requirements. For instance, a team might switch a numeric identifier to a UUID to prevent Integer Overflow errors or remove redundant/dormant columns.

An illustration is a JSON schema for a ‘Person’ business entity.

Semantics

Semantics focus on encapsulating the rules specific to each business domain. These rules encompass:

  • State transitions of business entities during their lifecycle (e.g., in e-commerce orders, the fulfillment date cannot precede the order date).
  • Relationships among business entities (e.g., a user in a user dataset can have multiple postal addresses but only one email id).
  • Business conditions (e.g., if a transaction’s fraud score isn’t null, the payout must be 0).
  • Deviations from the norm (e.g., the permissible percentage threshold above the average value).

Service Level Agreements (SLA)

SLAs are pledges regarding the availability and timeliness of data in a data product. They guide the design of data consumption pipelines.

Given the constant updates in data products, SLAs might include:

  • The expected time for new data arrival in the product.
  • Maximum delay tolerance for real-time data streams and late-arriving events.
  • Metrics like Mean Time Between Failures (MTBF) and Mean Time To Recovery (MTTR).

Metadata (Data Governance)

Data governance in a data contract clarifies security and privacy constraints and ensures compliance with your data products.

For instance, pseudonymization or masking of attributes defines their usage limits. Also, any Personally Identifiable Information (PII) must adhere to data privacy and protection laws like GDPR, HIPAA, PCI DSS, etc.

Common data governance elements include:

  • User roles with data product access.
  • Access time limits for a data product.
  • Columns with restricted access or visibility.
  • Columns containing sensitive information.
  • Representation of sensitive data within the dataset.
  • Additional metadata like data contract version, data owners’ names, and contact information.

Implementation

Data Types

The following data types are supported for model fields and definitions:

  • Unicode character sequence: string, text, varchar
  • Any numeric type, either integers or floating point numbers: number, decimal, numeric
  • 32-bit signed integer: int, integer
  • 64-bit signed integer: long, bigint
  • Single precision (32-bit) IEEE 754 floating-point number: float
  • Double precision (64-bit) IEEE 754 floating-point number: double
  • Binary value: boolean
  • Timestamp with timezone: timestamp, timestamp_tz
  • Timestamp with no timezone: timestamp_ntz
  • Date with no time information: date
  • Array: array
  • Sequence of 8-bit unsigned bytes: bytes
  • Complex type: object, record, struct
  • No value: null

Specification Extensions

While the Data Contract Specification tries to accommodate most use cases, additional data can be added to extend the specification at certain points.

Custom fields can be added with any name. The value can be null, a primitive, an array, or an object.

Design Principles

The Data Contract Specification follows these design principles:

  • A free, open, and open-sourced standard
  • Follow OpenAPI and AsyncAPI conventions so that it feels immediately familiar
  • Support contract-first approaches
  • Support code-first approaches
  • Support tooling by being machine-readable

Tooling

  • Data Contract CLI is a free CLI tool to help you create, develop, and maintain your data contracts.
  • Data Contract Studio is a free web tool to develop and share data contracts.
  • Data Mesh Manager is a commercial tool to manage data products and data contracts. It supports the data contract specification and allows the user to import or export data contracts using this specification.

Data contract defined in Data Mesh Manager

There is an example of a data contract defined in YAML format, covering schema, semantics, SLA, and governance-related checks.

A data contract defines the structure, format, semantics, quality, and terms of use for exchanging data between a data provider and their consumers.

Let’s explain this configuration;

Data Contract Details:

  • The dataset includes all order-created events with personally identifiable information (PII) removed.
  • It’s owned by a company or entity named “Checkout”.
  • The data contract version is 1.0.0.

Usage Terms:

  • Users are limited to a maximum of 10 queries per day.
  • The dataset is not suited for real-time use cases.
  • The cost for using this dataset is billed at $1000/month with a notice period of 3 months.

Data Model:

  • The dataset consists of an “Orders” table with the following fields: order_id, customer_id, email, phone_number, order_date, and order_total.
  • All PII within these fields is masked.

Schema:

  • The schema is provided in SQL Data Definition Language (DDL), outlining the data types and comments for each field in the “Orders” table.

Example Data:

  • Sample records are shown with masked email and phone number fields, alongside dates and order totals.

Quality Checks:

  • A series of SQL queries are provided to check the data for quality issues, such as:
  • Null values in critical columns
  • Duplicate order IDs
  • Invalid email addresses
  • Negative values in the order of total
  • Order dates that fall outside the range of January 1, 2020, to December 31, 2023

--

--