Data Vault solves Time-Crime.

Patrick Cuba
The Modern Scientist
10 min readOct 22, 2024

It’s an age-old problem when processing data into a target data model designed to track true changes; are the records being processed in the correct order? Slowly changing dimensional records depend on this order being correct otherwise what we may be reporting our analytics on could in fact be the incorrect state of that business object or relationship. Imagine this scenario,

Problem statement, out of sequence data loads break timelines

To correct this time-crime, data engineers are forced to roll back the previously loaded true changes and reprocess the records in the correct order. This might include investigating when the crime first took place and scheduling downtime to reverse the changes so that the data engineers on call can loop-load the records in the correct sequence based on an extract or load date timestamp. The downtime would include downstream repercussions if information based on the compromised data was already delivered to your downstream business users, or worse, if the data was already sent to regulators.

Data vault does indeed have a model and engineering pattern to dynamically correct timeline violations in the form of a data model artefact called the extended record tracking satellite (XTS). Introduced at wwdvc in 2019, XTS relies on the applied date timestamp concept making the data vault bi-temporal and is used to track every change record in your raw and business vault satellites which include:

  • Satellites with a dependent-child key
  • Multi-active satellites,
  • Status tracking satellites,
  • Record tracking satellites

Without further ado, let’s introduce the structure, modifications to orchestration and the crime scenes XTS is designed to solve.

The Extended Record Tracking Satellite (XTS) Table

Table Structure

The XTS table structure is only about tracking the metadata of the surrounding satellite tables of a hub or a link table. There is no customer data in XTS however we do store the record digest (HashDiff) from those satellite tables along with which satellite that HashDiff belongs to in a column with the name of that satellite table. What gives XTS special powers to correct timelines is that XTS will record every occurrence of new records staged for loading into the target satellite (the adjacent satellite table tracks true changes only). XTS will have far more records than the adjacent satellite tables it is tracking but XTS is wafer thin, whilst the satellite table may have one or hundreds of columns, XTS is consistently the same size with same number of columns. Consider too that XTS would be stored on a columnar database and therefore its compression will be optimised too.

Now compare the above XTS table structure to the satellite table structure below and you’ll notice that most of the columns are the same except those tracking the state of the satellite table’s records. There is an additional column in the satellite table tracking whether the record loaded into that satellite was the result of a regular ‘INSERT’ SQL operation or a ‘COPY’ SQL operation (psst… a single SQL statement handles both!).

There are typically five scenarios with respect to detecting true changes that can occur when the staged delta record arrives out of sequence,

Colours depicting True Changes

Crime Scenarios

To visualise the true changes and the effect of the late arriving record (the delta loads out of sequence) we will depict the first three scenarios where no correction occurs and the last two where corrections must occur. Keep in mind, the same single SQL statement is used for all five scenarios and XTS tracks every time it sees a new HashDiff, the satellite table only stores the record if it is a true change. All five scenarios will begin in the same manner on Day 1…

Day 1 for all scenarios

Crime scenario 1: Entity state never changes

Every delta record is the same at every load

Scenario 1, Day 2 … deltas are the same

Day 2: No insert is executed because the staged HashDiff is the same as the incumbent record in the satellite table

Scenario 1, Day 3 … deltas are the same

Day 3: No insert is executed because the staged HashDiff is the same as the incumbent record in the satellite table

Crime scenario 2: Entity state changes at every load

Every delta record is different at every load

Scenario 2, Day 2 … all deltas are different

Day 2: We insert the delta record into the target satellite table because the HashDiff is different

Scenario 2, Day 3 … all deltas are different

Day 3: We insert the delta record into the target satellite table because the HashDiff is different

Crime scenario 3: Entity state changed later

Delta record is the same as the older record

Scenario 3, Day 2 … delta differs

Day 2: We insert the delta record into the target satellite table because the HashDiff is different

Scenario 3, Day 3 … delta is the same

Day 3: No insert is executed because the staged HashDiff is the same as the incumbent record in the satellite table

Crime scenario 4: Entity state is corrected

Delta record invalidates the state of the entity

Scenario 4, Day 2 … delta is the same

Day 2: No insert is executed because the staged HashDiff is the same as the incumbent record in the satellite table

Scenario 4, Day 3 … delta differs, causes a timeline error, XTS auto-corrects the timeline

Day 3: We insert the delta record into the target satellite table because the HashDiff is different; however, because we have done so, the current state of the business object is not correct. We must copy the previous state of the business object and insert that record to correct the timeline. We do this with the help of XTS

Crime scenario 5: Entity state changed earlier

Delta record shows that the new state happened earlier

Scenario 5, Day 2 … delta differs

Day 2: We insert the delta record into the target satellite table because the HashDiff is different

Scenario 5, Day 3 … delta occurred earlier, we must insert but now we have duplicates

Day 3: We insert the delta record into the target satellite table because the HashDiff is different; we now have a duplicate record in the target satellite table. If you are using point-in-time (PIT) tables you will not see a duplicate record, you might also decide to use information mart build logic that dynamically condenses duplicate records you see in an underlying satellite table (you would have done the same thing if you were only selecting some attributes from a satellite table for a query that needs historical context).

The net effect of the timelines and their respective corrections for each scenario:

Implementation

Adapted data pipeline

For XTS to have an influence on entity state correction, the orchestration of table loads must be altered. However, it does not matter if XTS is run before or after the adjacent satellite table’s load because XTS is designed to correct timelines based on the already loaded record, not the current record we have just staged to load. Depending on your data platform, no table locking would be required either. Satellite tables are single source, XTS is multi-sourced but because record tracking is isolated to an individual record and Data Vault 2.0 is an INSERT-ONLY modelling paradigm, there should be no race condition in this implementation.

Don’t let the loads get too dirty!

The recommended implementation is to deploy one XTS per hub and one per link table. A single XTS will track the satellite table types in both raw and business vault. Because XTS is essentially a data driven solution for tracking records without storing any master (customer) data, we can extend its functionality further to track record retention policies as well.

Record Tracking

The regular record tracking satellite table is used to track that the record existed at all and its record digest is based on the hashed value of the applied date timestamp. The extended record tracking satellite table adds columns to help dynamic timeline autocorrection (as we have just seen) and it can be extended further to assist with dealing with record retention and disposal requests.

Retention State can therefore be:

  • ‘Active’ — record is active and queryable. Default
  • ‘Archived’ — record is in cold storage, and you will need to restore the record before being able to query it. Once restored the state should be set back to ‘Active’.
  • ‘Purged’ — record is purged and unrecoverable.

The above can be applicable to batch and streaming data loads (non-historized link and satellite tables), except for streaming data you should think about aggregating the data you need to archive or purge.

Managing identifiable data

We have previously discussed splitting personally identifiable information (PII) into its own satellite table because of the flexibility this design gives your data vault in terms of securing and deleting sensitive customer data.

To manage these records where a disposal request is made we can also do the following:

  • The target record in a satellite table containing the personally identifiable information (PII) is updated in place with the PII values permanently obfuscated but the hashdiff unmanipulated. This will ensure that if the same record resurfaces accidently that it will not be loaded into that target satellite table and risk exposure downstream. It also means that all other records in other data vault hub, link and satellite tables related to that customer entity are no longer identifiable and therefore do not affect your existing aggregations derived from your data warehouse. We will also record that a disposal request was made by inserting the value ‘True’ to the column ‘dv_disposal_requested‘ in XTS.
  • If the same record resurfaces in staging in a subsequent load, we record that fact against the same entity in XTS we previously recorded as disposed by setting the value of the column ‘dv_disposed_record_reemerged’ to ‘True’. This is an important check and is designed to ensure that the organization is not accidentally processing customer data it no longer should and because we record it in XTS, simple dashboards can be built to highlight these potential issues.

Bi-temporal Data Vault

A bi-temporal data vault ultimately means you can version your data based on applied date timestamp while XTS can correct the timeline if the timeline correction matches scenarios 4 and 5 (if the processed correction matches by record digest then no insert is performed). To use XTS your data vault must be bi-temporal anyway, this functionality ultimately adds even more flexibility to the data vault, a record loaded by applied date timestamp is versioned by the load date timestamp. A newer load date timestamp is a newer version of that same applied timestamp for a business object (or relationship in a link satellite table), let’s illustrate what we mean here.

Versioned record may cause a need for a timeline correction

Notice how in XTS we have two records for the same surrogate hash key with the same applied date timestamp, the newer record (load date timestamp is 15-Jan-2024) is the new version of that record. Scenario 1 has now become scenario 4 with the need to copy the correct state from the older record to ensure the timeline continues to be correct for this business entity.

If a record that was omitted previously suddenly appears out of sequence for the applied date timestamp of 3-Jan-2024, in order to determine which scenario is in play XTS must compare against the newer record for the applied date of 5-Jan-2024, yes the record comparison must be done against the version inserted on 15-Jan-2024 as the load date.

A new insert must compare against the latest version of the record in XTS

Now scenario 4 has become scenario 5; 3-Jan-2025 applied date record has the same HashDiff of the newer record for this business object, but because this staged record indicates that the change in the business object’s state occurred earlier, the staged record must be inserted into the satellite table carrying with it a new version date (load date timestamp).

Wrap-up

Where you will find XTS, one per hub, one per link, ensuring satellite tables always have the correct state for a given business object or relationship

XTS is purely tracking metadata about your records in a repeatable and scalable manner. It does require a different pattern for automation where you might have had a single data source staged into a target satellite table; this will now require XTS to be a part of that mix. The benefits are there, and you will need to consider if the extra orchestration complexity is worth the payoff.

Some organizations do face ongoing upstream orchestration problems for their batched data. This can occur for any number of reasons and is especially difficult to solve in extremely complex software environments where the cost to unravel that complexity is just too high. XTS should not be used to paper over operational issues upstream though, time has shown that these kinds of issues only grow and eventually by adding this complexity to a data vault, that it too will become too difficult to unravel.

The full implementation, explanation and code for all these scenarios are available in ‘The Data Vault Guru book’, https://amzn.to/3d7LsJV

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

--

--

The Modern Scientist
The Modern Scientist

Published in The Modern Scientist

The Modern Scientist aspires to connect builders & the curious to forward-thinking ideas. Either you are novice or expert, TMS will share contents that fulfils your ambition and interest. Write with us: shorturl.at/hjO39

Patrick Cuba
Patrick Cuba

Written by Patrick Cuba

A Data Vault 2.0 Expert, Snowflake Solution Architect

No responses yet