Bringing Disparate Data Sources Under Control with Good Metadata

Ministry of Justice Digital & Technology

Published in

Just Tech

3 min readSep 3, 2019

by Robin Linacre, Karik Isichei, George Kelly and Adam Booker (Data Engineering Team)

As a data engineering team, we are responsible for harnessing data and making it simple to use for our analysts. We process data from a plethora of sources, from decades-old database systems through to microservices on our Cloud Platform. Our data needs to be quickly delivered to live webapps, and analysts who use R, Python, and Spark. This variety means we need to find a good tradeoff between consistency and flexibility in data processing.

The more we work on this problem, the more we understand the importance of tools that standardise our data processing, transforming data sources from arcane, unreliable formats into dependable, reusable commodities.

Central to this effort is the development of a standard for machine readable metadata, and the realisation that once this exists, it is useful in almost every stage of the data pipeline:

The metadata can serve as specification for data providers, who can easily use our open source library to check conformance of their data against the spec, knowing we will be applying exactly the same checks.
At data ingestion, we can automate multiple checks of whether the incoming data conforms to the metadata, yielding detailed web reports of rows which failed the checks. See here for a simple demo.
During data processing, we can automatically harmonise the wide variety of incoming column data types (string, float, int etc.) from different database systems into a common set.
We can automatically convert our system-agnostic metadata into the format required by specific data storage solutions, automating the process of setting up databases. This makes it trivial, for instance, to generate the code needed to setup an AWS Athena Database.
The metadata can be automatically added to a central, searchable data catalogue, enabling data discoverability. We have developed an open source GUI on top of our metadata to enable easy searching and automated SQL query generation.
Since the metadata for a particular table is just a json file, it can be placed under version control. The metadata is a necessary part of our ETL code, and so it lives in the same repository, meaning metadata stays in sync with the code.
Finally, authoring metadata is simple and fast. Taking inspiration from the jsonschema generator we can automatically generate a first draft of metadata from existing datasets, and use of our metadata schema enables autocompletion of manual edits in text editors like VS Code.

A simple example

We have developed an online, interactive demo of some of our tools, which you can find here. This notebook demonstrates how we can:

Produce a validation report of a dataset which validates successfully against a metadata schema
Produce a validation report of a dataset which fails to validate
Auto-create a draft metadata schema from an existing dataset

Bringing Disparate Data Sources Under Control with Good Metadata

Written by Ministry of Justice Digital & Technology