Data Modelling Isn’t Dead: But We Need It To Evolve

Whilst the engineering behind big data projects gets a lot of the limelight, we still need to keep focused on how the data delivers business value.

Carl Follows

Published in

Version 1

5 min readMar 8, 2022

This introduction to Data Vault modelling gives another approach to delivering that value.

Connected 3D blocks — Photo by Shubham Dhage on Unsplash

TL;DR

Data Vault modelling offers us a flexible data modelling approach, allowing our model to evolve with the business it represents.
This flexibility means we can leverage standardised data pipelines & reduce data engineering complexity as the business & its data model evolves.

Engineering Change

Organisations are increasingly thinking about their data analytics challenges as purely engineering problems: how to get timely data to people.

Previously, the focus was on modelling the organisation’s data, using methodologies such as Kimball star schemas. Latency was of a secondary priority since reports focused on the rear mirror view, and yesterday’s data was adequate to support strategic planning.

There is now an expectation of near real-time data at a high level of detail with the continual addition of new data points & deprecation of old ones — Variety & Velocity. This is driving a focus on data engineering & data delivery.

Organisational change

To predict where you’ll be in a few years’ time requires you to be able to explain how you got to where you are now.

Businesses change & data systems need to effortlessly represent that change. As operational systems and processes evolve, so must the data model which describes it.

In the past, nightly loads allowed time for refactoring of the transformation logic and the data warehouse to be rebuilt into new models. Though this often resulted in the loss of the old models and that version of the truth.

The improvements in data engineering and the ability to apply schema-on-read allow us to delay defining the data model. However, without careful planning, the data engineering behind these models can soon become complex as the pipelines try to adapt to a changing organisation.

Ultimately, a data model is a representation of the business, it must be able to evolve alongside the business, without having to refactor swathes of data engineering code. We need to not only represent the business as it is, but also as it was as well as all the interim states on that journey.

Data Vault Modelling

Building a layer between the operational systems and the analysis data marts allows the turmoil of organisational change to be abstracted away from data consumers who require a simplified view. Whether this intermediate layer is a data lake or a data warehouse, it must still be modelled in a way that allows it to evolve and retain historic information.

The Data Vault Methodology was conceived back in the 90’s by Dan Linstedt, continuously developed since, and most completely described in the book “Building a Scalable Data Warehouse with Data Vault 2.0” By Daniel Linstedt and Michael Olschimke.

Whilst many would describe a key benefit of Data Vault Methodology as its ability to maintain a detailed audit trail, it’s the evolutionary ability which I find the most compelling reason for its adoption.

This is possible thanks to the physical separation of the business identifiers (into hubs) from their attributes (into satellites) and the use of bridging tables (links) which allow the cardinality of relationships to change.

Using multiple satellites per hub allows attributes to be mastered in different operational systems, with different lifecycles and rates of change.
This simplifies data pipelines and reduces the chance of complex dependencies forming within the data platform.

Evolutionary Modelling in action

As an example, if we wanted to understand which products are available for sale at a store, we might create a model as follows.

Data Model of a relationship indicating products offered at a store

Hub tables contain just the business identifiers and the last time the identifier was supplied in the source data.
Link table contains the fact that the product has at least once been offered at the store, but no more information than that.
Satellite tables contain all the attributes including any effective dates of the product being on offer.

What’s powerful about this model is that data never needs updating, helping both audit and performance alike. Instead of updates, the HashDiff attribute is used to check if an attribute has changed since the previous load, if so, a new record is created in the satellite table and the old record has its LoadEndDate set.

You could say there’s nothing special about this, that it’s very similar to the classic Type-2 Slowly Changing Dimension. However, when implemented rigorously it allows for a simple evolution of the data model as the business changes.

When the store evolves to be a marketplace, you can see a new link table is created, indicating which vendor is offering each product. We don’t remove the old link or satellite table, just set the LoadEndDate to mark all the data as historic.

Aside from that, all changes are additive in nature, allowing for a smooth evolution from the old data model to the new, whilst simultaneously supporting any historical reporting needs.

Data Model of a relationship indicating products offered at a store by vendor

Plan for your next Data Platform

Data Vault modelling is less intuitive than the more familiar Inmon and Kimball methodologies. It requires a more specialist modelling skill and hasn’t received the same degree of visibility as its peers.

Whilst it adds complexity, the standard patterns it defines lend themselves to a templated, meta-data driven pipeline; more than that, a platform that’s driven by metadata is much easier to maintain than one that’s been handcrafted.

So next time you’re enhancing your data model take a look at the Data Vault methodology, you might find it answers some of the challenges you’re facing and allows you to create a model that can evolve alongside your business.

About the Author

Carl Follows is a Solution Architect in Version 1’s UK Digital, Data & Cloud practice, specialising in Data Analytics platforms built on Microsoft Azure.