The Different Grains of Multi-Active Records.

Patrick Cuba
The Modern Scientist
9 min readOct 8, 2024

Common feedback when data professionals learn of what a data vault model is, is that they foresee the large number of tables being modelled, deployed and managed. They’re not wrong, as my former mentor put it, you should expect one and half times more tables defined in a data vault than an equivalent 3rd normal form data model. However, when we compare the benefits and detriments between the various data modelling methodologies, data vault meets somewhere in the middle. Observe,

Serious data models

Inmon 3rd normal form data models are industry focussed; its constraints ruled by business processes. Kimball models offer a level of denormalisation to which OLAP data platforms are optimised to serve and structured to answer business questions. Data Vault has an element of both, hub tables record the instances of business objects a business cares about while it has the patterns to make full use of OLAP optimisations, but it requires upskilling to learn how to blend the two. Best of all, a data vault is non-destructive to change where the other data models require refactoring.

To address the fact that data vault does indeed mean we are managing one and half times more tables than a 3NF data model, let’s establish a few soft constraints we will aim towards on how we model our Data Vault, so we do not go beyond this ratio:

  • Top-down modelling, we do not model weak hubs or “hubs that behave like links”; hub tables are the modelled representative of business key repositories of your business and nothing else. We will never find dates as hub tables, and we do not aim to replace Kimball modelling with a Data Vault (they’re complementary). You should never see a data vault model with 100+ hub tables, just how many business object types are you really modelling? Are you merely ensuring you have something to join on when you model these?
  • Dependent child keys (related to the above point); when modelling codes, reference data and other sub-category identifiers these cannot be modelled into hub tables simply because they cannot stand on their own. If you were to model them as independent hub tables, you must rely on creating more tables to link these ‘weak hub’ tables to everything else It is also not an efficient way to avoid link-based satellite tables. As we have shown in the previous post, link satellite tables work very well if you know how to use them. A pragmatic data vault also does not model dependent-child keys into link tables, we load them into satellite tables instead. The advantages of using this approach are profound,

o Reduced number of modelled tables simply because you do not have link tables built just to link to weak hub tables (creates a hub-sprawl).

o Reduced link and link-satellite table complexity; if you have a dependent-child key in a link table, is its value included in the link-hash key or not? Because the link table is a unique list it means a dependent-child key modelled here must be true for the life of the relationship. How do you trace dependent-child keys if you have a driver key and effectivity satellite modelled on the same link table too?

o Increased model scalability by allowing for dependent child key versioning and encouraging link table reusability; a link table may have zero (non-descriptive link) or more link satellite tables and each can operate at their own independent grain. Ultimately you’ll have fewer tables to manage here too.

By moving dependent-child key columns to the satellite table the satellite table would appear to be multi-active; but it is not. It is tracking the record’s state (true changes) to a finer grain to that of a regular satellite table but not tracking changes to a set of records we would otherwise model in a true multi-active satellite table. Keep the following image in mind as we traverse to the scenarios to show the difference…

All three can exist off a hub or link table; satellite splitting permitted

Enough theory let’s dive into how these satellite tables differ in structure and understand their respective use cases.

Table Structure

Satellite tables contain the descriptive state information of a parent hub or link table.

Note that there is no additional column needed for a satellite table with a dependent-child key. Dependent-child keys are columns that are merely identified and included in the record digest (HashDiff); therefore, when tracking true changes we track by the satellite table’s hash-key and one or more dependent-child keys. We will track the effect of these in the subsequent section, meanwhile this will be our data model and pipeline for the remainder of the article.

The Data Model and Data Pipeline

Scenarios

To keep these examples simple, we will use dummy key values without hashing which will make this article easier to follow and we will limit the descriptive attributes merely to illustrate the differences between the two satellite tables used in the model. We will depict three scenarios:

  • When attribute values change for the same dependent-child key
  • When a new dependent-child key appears
  • When the number of active records changes, a SET change

We will simulate the same data being loaded to both satellite table types to illustrate the difference.

Setting the scene, day 1 loading empty satellites

‘Code’ column in the above example is included in the calculation of the record hash digest (HashDiff), it is merely an attribute. We make it visible in this example for the multi-active satellite too, so it is easier for you to trace the difference between the satellite table types.

Now… let’s push through some changes…

Day 2

Day 2 we can see differences in satellite table content already

Dependent-child key satellite

  • Entity key ‘101’: when we see an attribute’s value change it will change the record digest, because we are tracking changes for the dependent-child key ‘A’ the new attribute value is now the current active record.
  • Entity key ‘456’: we see only one of the dependent child keys change its descriptive content and that is the only record that is version.

Multi-active satellite

  • Entity key ‘101’: any change generates a new set, the outcome mirrors what the dependent-child key satellite does.
  • Entity key ‘456’: one of the records in the set has changed an attribute and therefore the record digest is new, what has been staged is now the new SET. Dependent-child key does not factor into how sets are determined, it’s still just a column contributing the record digest.

Day 3

Day 3 sees even more dramatic differences between these satellite tables

Dependent-child key satellite

  • Entity key ‘101’: a new dependent-child key has arrived with a code value of ‘B’; this does not version the previous dependent-child key. It simply becomes a new timeline we are now tracking for this entity and dependent-child key combination
  • Entity key ‘456’: nothing has changed for this entity id except that we only received the same record for two of the three records in the set. Nothing happens in the target satellite table.

Multi-active satellite

  • Entity key ‘101’: we simply version the record, however anything with the dependent-child key value of ‘A’ is part of the old set of records and the current set has a code value of ‘B’
  • Entity key ‘456’: a change in the number of records in the set means we insert the new set we have staged as the new active set for this entity.

As for the timelines, let’s plot these here:

dep-key satellite vs multi-active satellite

This is the absolute difference between the satellite tables, tracking to a sub-category of the parent entity versus tracking changes in the active set of records for the parent entity; whether that’s a business object (hub table) or a relationship (link table). More advanced scenarios are detailed in ‘The Data Vault Guru book’ like:

  • Dealing with Time-Blobs, when your source-application has used slowly changing dimensions type 2
  • Resolving inaccurate data vault representation due to many-to-many cardinality; yes if you have loaded your data into your data vault you can in fact get inaccurate metrics when querying it.
  • Solving poorly modelled denormalised source data

On the scale of modelling master data choices, multi-active satellites are in the extreme end of the spectrum because the use case is very specific.

Now that is out of the way, let’s discuss some other use cases… starting with modelling batched transactions into a data vault..

Intra-day key (the right way to model transactions)

Data vault makes the distinction between near-real time streaming events and modelling these events in a batch-type workload. This refers to the use of non-historised link (formerly known as transactional link table or t-link) or satellites tables. Neither has a record digest because both are meant to support analytical data where the time it takes to generate a hash-key and a hashdiff and the act of checking if the record you’re about to load is a true change is detrimental to the use case for processing that data. Yes, this mechanism should be guaranteed upfront from the upstream technology itself… that is ’exactly once’ semantics and records should appear in the correct order.

For batched transactions we will have multiple ‘active’ records for the parent entity although a new transaction does not end the previous transaction. It is an atomic fact, an event or transaction, something that happened at a point in time that we want to track the metrics of and perhaps aggregate for downstream use cases. The uniqueness of transactions is usually identified as the transactions id or the account number along with the transaction date. For an event it is likely the event id and the epoch timestamp of when that event happened. In a data vault of course, you do not model such a thing as a hub_transaction to store transaction ids, these IDs have no meaning without the parent entity or more likely the parent entities. Modelling batched transactions into a data vault should be modelled as a link-satellite table with the transaction ID defined as the dependent-child key, we call this the intra-day key. For a full implementation of how we can do this visit this link, “Expanding to Dimensional Models”.

The same pattern can be used to load a data vault when the number of records landed has a higher cadence then those processed into a data vault… on Snowflake you can implement what I called a “Kappa Vault” architecture pattern which negates the need to define an intra-day key and uses staging to guarantee record uniqueness being loaded to a target satellite table, for details of this pattern visit this link.

Wrap-up

For a single data application sourced to populate a data vault you might be faced with the prospect of loading the dependent-child key value to multiple target satellite tables, that is ok. A few target satellite tables may not need it; for example, if that same sourced table / view contains personally identifiable data then the target PII satellite table likely will not be tracking to the dependent-child key grain!

The above patterns also open ways you could be generating new data in terms of populating a business vault. Yes, if the soft business rules transformations you’ve developed may in fact need to track the business rule outcomes in a satellite table with a dependent-child key or a multi-active satellite table!

I hope that clarifies and simplifies your understanding of the different grains of multi-active records. To close off, do not default all satellite table loads to be multi-active, know and understand what grain of data is being loaded to ensure and manage the data at the expected grain. The full implementation, explanation and code for all these scenarios are available in ‘The Data Vault Guru book’, https://amzn.to/3d7LsJV

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

--

--

The Modern Scientist
The Modern Scientist

Published in The Modern Scientist

The Modern Scientist aspires to connect builders & the curious to forward-thinking ideas. Either you are novice or expert, TMS will share contents that fulfils your ambition and interest. Write with us: shorturl.at/hjO39

Patrick Cuba
Patrick Cuba

Written by Patrick Cuba

A Data Vault 2.0 Expert, Snowflake Solution Architect

Responses (1)