Throwing down the gauntlet to the data quality with Data Schemas

Albert Franzi
Albert Franzi
Published in
7 min readDec 21, 2018

Throwing down the gauntlet to the data quality while keeping up the data up to date by using Data Schemas.

Tørvikbygd ferjekai, Norway. by Albert Franzi.

This is the first of a series of articles in which we describe our approach to flexible and privacy-compliant data collection. Specifically, we will examine here how to create a good data quality culture through the use of data schemas.

This article introduces the concept of a data schema and its boundaries. The next incoming articles will cover how we handle the schema evolution and data anonymization using JSLT, and how all this fits together in our data collection system.

Context

At Alpha Health we are developing genuine data products and hence we regard data as one of our most valuable assets. In Rem!x, for instance, we collect information provided by the users to recommend activities that might increase their well being. For such products that are in constant development, data also evolves continuously to meet the requirement. Hence, from the product version to version, some data entities are expected to change and even to look completely different. To keep track of the changes over time while improving the knowledge and understanding of our data, we need to accurately define our data entities using schemas and how these schemas have to evolve from version to version.

What is a schema

The schema of a data entity is its DNA, and defines the Structure, Format, Content, and Quality of it.

Defining the proper schema helps to have a better and clearer understanding of the data. And a clear understanding of the data allows creating better products for the users.

A schema makes explicit which data fields are expected, compulsory, which combinations of fields are valid, and even which ranges of values each field should be in.

Besides, schemas can be regarded as a contract between data producers and data consumers, so they can improve their collaboration by using the proper data definition.

Standards

There are some standards trying to structure the data that surrounds us. Some of them are Schema.org & JSON-schema.org.

  • Schema.org is a collaborative active community with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond.
  • JSON-schema.org is a vocabulary that allows annotation and validation of JSON documents. Similar to Avro schemas on Open API (formerly Swagger).
  • Open mHealth is an open standard for mobile health data.
  • H.860 is multimedia e-health data exchange services. Its goal is to define a Common health data schema.

But, which value do we get from using schemas?

In Alpha Health, we use the JSON-schema.org standard since it brings us the advantage of describing our existing data formats by providing a clear human and machine-readable documentation.

Besides, because JSON Schema is a widely adopted standard, we can leverage on existing tooling. For instance, we can resort to existing open source libraries such as JSON Schema Validator to guarantee the quality of the data ingested in our system.

Ref: json-schema.org // Getting started step by step

JSON Schema can also be used to require that a given JSON document satisfies certain criteria. The language specification provides keywords that provide this functionality, in addition, a set of keywords is also defined to assist in interactive user interface instance generation.

This specification will use the concepts, syntax, and terminology defined by the JSON Schema core specification.

Designing Schemas

Representing the data with schemas is crucial to have a better understanding of what data means. For this reason, it’s really important to know how to define them properly.

In the next sections, we explain how to assemble a schema, which data types can we use for such purpose, how to apply patterns and formats to the content, and how to combine multiple schemas to create complex data specifications.

Identifying the schema

Ref: Understanding-json-schema # The-id-property

Every piece of data can potentially be defined by a schema, that means we would end up eventually having many schemas defining each of our areas. In order to be able to validate a piece of data, this needs to provide a reference to the schema it instantiates.

For that, we are going to include a “schema” field in our data with this information. This schema field will match against the $id field from the schema definition.

Example from Understanding the JSON schema

The $id property is a URI that serves two purposes:

  • It declares a unique identifier for the schema.
  • It declares a base URI against which $ref URIs are resolved.

$id also provides a way to refer to subschema without using JSON Pointer. This means you can refer to them by a unique name, rather than by where they appear in the JSON tree.

Types

Ref: Understanding-json-schema # Types

The type keyword is fundamental to JSON Schema. It specifies the data type for a schema.

At its core, JSON Schema defines the following basic types: String, Numeric types (integer, number, ranges, and multiples), Object, Array, Boolean and null.

Types can also be complemented with other keywords that further specify the data constraints. The following schema depicts two examples.

String and Number examples using type properties.

Patterns & Formats

Ref: Understanding-json-schema # Regular-expressions

The pattern keyword is used to further restrict a string to a particular regular expression. The regular expression syntax is the one defined in JavaScript (ECMA 262 specifically).

The use of patterns is quite useful since it allows to define some extra properties that the data content must follow but the actual enums or formats are not contemplating. To create and test regexes, we recommend checking the regex101 tool, which allows testing the regular expression against some text, besides it provides a really good explanation on how is performing and working the regex.

North American telephone number

Ref: Understanding-json-schema # Format

The format keyword allows for basic semantic validation on certain kinds of string values that are commonly used. This allows values to be constrained beyond what the other tools in JSON Schema, including Regular Expressions, can do.

Some of the formats specified in the JSON Schema specification are data-time, email, hostname, ipv4, ipv6, URI, and URI-reference. These formats should allow validating the most common fields used in user or network data.

Combining Schemas

Ref: Understanding-json-schema # Structuring a complex schema

One of the most powerful properties of JSON Schemas is the ability to compose schemas into more complex ones. This is useful to structure the schema into parts that can be reused in a number of places. Even in real life, we are used to structure/split the objects that we see in different parts (i.e a car contains wheels, doors, seats, belts, an engine, etc). By defining each object separately, we can have better object definitions and re-usable ones.

Example of combining multiple schemas (address, product & user) to form the Order even schema.

Guidelines

The JSON Schema specification is pretty open and flexible, but sometimes it is important to enforce certain styles and conventions in order to facilitate collaboration across engineering and data teams (or producers and consumers in general). This is precisely the approach we are taking at Alpha.

It is much easier to understand a large codebase when all the code in it is in a consistent style.

These are two great JSON guidelines to use as a base guide to define your own ones:

In our case, we have decided to stick to the Zalando guidelines but instead of having the property names with “snake_case” we will define them with “camelCase”.

We hope this article brings some light about what is a JSON schema and which advantages can bring them to us.

Stay tuned for the next article where we will explain how are we using JSLT expressions to evolve data and face up the GDPR by design.

If you are using JSON schemas in your company, don’t be shy and share with the community which uses cases and experiences did you have using them 🧐.

Links Of Interest

--

--