Automated Data Governance with Data Contracts

Published in

Mesh-AI Technology & Engineering

6 min readJun 5, 2023

Introduction

Data governance refers to the act of managing your data assets through policies and processes in order to maximise value and minimise risk. Whilst often seen as a time consuming task, we have explored the opportunity of simplifying data governance through data contract powered automation. By defining data contracts as both descriptive and prescriptive documents, they can power the validation of datasets and data products. This enables business professionals to implement data governance simply without needing technical knowledge or time-consuming manual metadata collection and entry.

What is data governance?

Let’s dive into what we mean by data governance, as it often means different things to different people. In our case, organisational data governance ensures data is well looked after, easy to find, easy to use, well documented and has accountability built-in through ownership.

An easy way to think about data governance is as a library. We need a system that:

keeps track of all of our books (our data).
provides a way to find books (data).
tells us where and how to retrieve them.
has a system that allows us to check out (read) the book (data) we want.
keeps track of who has checked out (has access to) what.

In our data world, this typically means having and populating a data catalogue, data owners, capturing data access, and monitoring whether what we have tracked is accurate.

Getting this right has many benefits:

The time for your teams to get access to the right data is reduced.
You have a clear understanding of who has access to what.
You can view your entire data estate at once to see if you have the data you need.
You know who to go to to get access or ask questions.

Regarding data quality — an issue for many organisations — you can have complete confidence that your data is as you expect it and meets the agreed definition across your business, avoiding unwanted surprises downstream.

What’s the problem?

Your teams must continually update your governance implementation to reflect your data estate to maintain these benefits. This will evolve over time since:

data ownership may change overtime.
your data may change either by implementing new capabilities or schema drift.
new data will appear and old data may be deleted.
new regulatory or organisational standards will arise.

Keeping up with these changes is time consuming. Your teams become a bottleneck for data governance as every step requires manual implementation. This represents a fundamental problem for your product teams since they must both maintain the governance implementation and build their data products. Your rate of product development inevitably decreases and your governance may become inaccurate.

What’s the solution?

At Mesh-AI, we have been leading the charge on using data contracts as a reliable method to describe and guarantee data products. Extending this, we are implementing automated data governance solutions powered by these same data contracts. By introducing automation, governance can be guaranteed at all stages regardless of changes to the data, definitions, or the regulations surrounding it.

Our solution is underpinned by data contracts. Data contracts can be defined in various ways, but we define ours clearly:

“A document that not only describes the data, but guarantees it, through both testing and reporting on whether the data meets the definition”

In essence, a data contract tells you everything you need to know about the data and, as a consequence of existing, gives you confidence the data meets that definition. They are created against data products to detail:

Ownership and Relevant Data Experts
Purpose
Datasets (including technical definition and individual descriptions)
Data quality rules for each independent dataset
How to interact with the data product
Service Level Agreements for change requests and fixes
Methods for requesting & implementing changes
A reference to who is reliant on the data product

You can consider data contracts to be a Terms of Service agreement for your data products, written by the Data Product Owner and their team, which you as a data consumer agree to. As a result, the Data Product Owner guarantees to provide what you signed up for in the way they expect you to use it.

With this document in place, we use a central set of automation tools to read the contract, assess the data it is written against, and confirm whether the data matches the contract’s definitions. The tooling also informs Data Product Owners and Data Consumers of data quality issues, keeps data catalogue up to date, and continually enforces central regulatory definitions.

What does this look like in practice?

There are two core components: the Data Contract and the Data Governance Automation code. For each, a central data governance team defines the standards and methods so that data product teams can easily govern their data products.

1. Creating the data contract

We create the data contract as a simple JSON or YAML document. This allows humans to easily read and write it and for computers to parse the information.

The data contract must include data quality rules to describe the limits of the data and to be used for assessing the data against these rules. Whilst you could code these rules yourself, we have found using the Great Expectations library to be the simplest way to achieve this functionality.

{
  "metadata": {
    "contract_template_version": 1.0,
    "contract_version": 1.0,
    "created": "30/05/2023",
    "last_updated": "02/06/2023",
    "updated_by": "dan.ward@mesh-ai.com"
  },
  "contract": {
    "data_product_definition": {
      "name": "Example Data Product",
      "description": "An example data product",
      "data_product_owner": "dan.ward@mesh-ai.com",
      "data_experts": [],
      "domain": "ExampleDomain",
      "endpoints": [
        {
          "system_type": "azure_datalake_gen2_path",
          "connection_address": "https://example.blob.core.windows.net/",
          "connection_root": "test-123",
          "connection_schema": "",
          "connection_object": "something.csv"
        }
      ]
    },
    "datasets": [
      {
        "name": "Something csv",
        "inclusion": "Example data only",
        "granularity": "One record per individual",
        "maintenance_of_history": "As long as this is being used for demonstration purposes",
        "refresh_cadence": {
          "human_readable": "This dataset is updated manually and not refreshed on any cadence",
          "period": "Other"
        },
        "connection": {
          "system_type": "azure_datalake_gen2_path",
          "connection_address": "https://example.blob.core.windows.net/",
          "connection_root": "test-123",
          "connection_schema": "",
          "connection_object": "something.csv"
        },
        "schema": {
          "schema_type": "gx",
          "schema_feed": {
            "name": "example.blob",
            "system_type": "lfs",
            "lfs_path": "great_expectations/expectations/example/blob.json"
          }
        }],
    "change_management": {
      "management_detail": "Contact the data product owner with any change requests",
      "sla_working_days": 15
    }
  }

An example of a data contract. In this implementation, the quality rules are kept separate in a Great Expectations ‘Expectation Suite’ file, which is referenced in the main data contract.

2. Create the application

Next, we create an application which consumes a data contract to perform 2 actions:

Use the contract definitions to assess the data against the quality rules.
Populate the data catalogue with the contract definitions through APIs.

Coding such a solution is non-trivial, requiring engineers who are skilled in programming and with a fundamental understanding of cloud solution development. Nonetheless, we have found this is achievable with Python and Azure Functions, creating a highly cost efficient and serverless solution. Developing this tool centrally removes the burden on your data product teams to implement data governance approaches against their own product. This limits repeated work, wasted effort and the need for higher levels of technical expertise. In general, this approach for centralised data automation reduces the overall technical skill requirements of your organisation, even though it requires a comparatively high level of skill within the central team that delivers the solution.

3. Orchestration

We configure the central code to trigger on a set of conditions:

When the data contract is updated: the automation must rerun to confirm the data meets the new definitions.
When the data itself is updated: the automation must rerun to confirm the data continues to meet the definitions.
When the refresh cadence defined within the data contract is exceeded: the automation must rescan the data to confirm it updated within the defined period.
When a user triggers the automation manually.

What are the results?

Mesh-AI has implemented this end-to-end automated data governance approach across a number of real-world situations. In all cases, the approach has been well received. Business-focused professionals can easily create and maintain a data contract thereby minimising the required technical skills and time required to implement good data governance. Whilst time consuming to implement, we have seen that automated data governance has clear return-on-investment results.

We continue to develop this approach including making it agnostic of catalogue providers and using more efficient processing methods. We will share how this approach evolves in a future post.