Snowflake
Published in

Snowflake

Data Vault on Snowflake: Out-of-sequence data

Snowflake continues to set the standard for Data in the Cloud by taking away the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. You will expect the same relational capabilities for your data model as any other platform and Snowflake certainly delivers. A possible integration issue is the need to deal with a batch file that arrives out of sequence. Does that mean you need to roll back the previous day’s batch data to get the data sequence in order? Does it mean that the dashboard reports need to be rolled back and the corrections explained?

Episode catalogue,

1. Immutable Store, Virtual End-Dates

2. Snowsight dashboards for Data Vault

3. Point-in-Time constructs & Join Trees

4. Querying really BIG satellite tables

5. Streams & Tasks on Views

6. Conditional Multi-Table INSERT, and where to use it

7. Row Access Policies + Multi-Tenancy

8. Hub locking on Snowflake

10. Out-of-sequence data

9. Virtual Warehouses & Charge Back

11. Handling Semi-Structured Data

A reminder of the data vault table types,

For completeness we want to consider the following variations of satellite tables as well as they too can be compromised by late-arriving or out-of-sequence batch data,

Surrounding business objects (hub tables) and transactions (link tables) you may have a need for the following satellite table types. Yes, they too could be affected by out-of-sequence batch data.

All the above satellites can be managed with the addition of the following satellite table, apart from the effectivity satellite tables (EFS).

File-based automation has a file-based extract date, that date is essentially the applied date, the date that according to the source platform all the states of the business objects that source system is automating is valid. Whether that be a snapshot or delta feed is irrelevant, the data presented to the analytics platform is the current state of that business object.

The problem may occur (for numerous reasons) when we receive state data out of sequence, or late arriving. Since Data Vault tracks changes, an out of sequence load may present some business data integrity issues. But within data vault we have the extended automation pattern to deal with this, and deal with it dynamically. This pattern is only possible because with Data Vault 2.0 we do not physicalise the end dates (as we saw in episode 1) we virtualise them. Here’s the problem scenario.

Episode 10: Out-of-sequence data

Problem scenario

Let’s use the above to acclimatise to how we will present the problem and the solution below.

On the left is the landed data, for simplicity we will track a single business key. We have already processed the first two records into the target satellite table on the right. We received a Monday record, then a Wednesday record. Because each record’s HashDiff (the record digest we use to compare new against existing records) was the same (‘Sydney’ and ‘Sydney’ on Monday and Wednesday respectively) we end up with the first occurrence of Sydney in the target satellite table only. The late arriving record is the key’s state for Tuesday, its HashDiff differs to the older record of Monday (‘Sydney’). Therefore, we must insert that record and because we have inserted that record the active state of the key is now incorrect!

Our timeline now shows ‘Brisbane’ as the active record when it should be ‘Sydney’!

  • Monday: Sydney
  • Wednesday: Sydney, no need to insert into the satellite table because it is the same as Monday!
  • Tuesday: Brisbane, late arriving record, we must insert but now the timeline is not correct!

Data Vault does have an automation pattern to deal with batch/file-based data that arrives out of sequence. With a little ingenuity we can extend the record tracking satellite artefact to track records for all satellites around a hub or link table.

Extending the record tracking satellite

A single extended record tracking satellite (XTS) will be used to manage out of sequence data for each hub and link table. Data Vault’s record tracking satellite (RTS) records hashdiffs for the applied date; we will change that to record the record digest and extend RTS to include the target satellite table name within XTS itself, denoting which adjacent satellite that hashdiff belongs to.

Column descriptions

  • Hash Key is the hash key belonging to a satellite table
  • Load date is the date the record was loaded
  • Applied date is the package of time date
  • Record Target is the name of the satellite table where the HashDiff belongs to
  • HashDiff comes from the landed data but represents the applicable record-hash digest of the adjacent satellite table, we record it in XTS for every occurrence we see that record coming in from the landed content

XTS will be thin, we will record every hashdiff that comes in from the landed data for that satellite table, even if it has not changed. The adjacent satellite table will of course contain the descriptive attributes whereas XTS table will not.

Let’s now see how XTS can help loading data vault satellite tables through 5 scenarios.

Scenarios

Scenario 1: Every delta is the same

Let’s start with an easy example!

The late record arrived, and its hashdiff is the same as the previous and next records in the timeline. We record the hashdiff in XTS and load nothing into the satellite table because the hashdiff has not changed.

Scenario 2: Every delta is different

The late arriving delta record differs to the previous record; therefore, we must insert the record. The new record does not compromise the timeline and we record the hashdiff in the XTS table and insert the record into the adjacent satellite table.

Scenario 3: Late record is the same as previous

The late arriving record has the same hashdiff as the previous record in the timeline. No insert is needed in the satellite table and the timeline remains the same. The record hashdiff is inserted into the XTS table.

Scenario 4: Late record causes timeline issue

The late record’s hashdiff differs from the previous record’s hashdiff in the timeline, we must insert this delta. However, because we have done so, the timeline now appears incorrect. This is the problem scenario described earlier. We must now copy the previous record (Monday) in the satellite table and insert it as Wednesday’s record into the satellite table with the descriptive details from Monday and therefore correcting the timeline. Note that the virtual end-dates are naturally correct based on the physical table underneath. If the end dates were physicalised we would have to resort to running SQL UPDATEs on the table and churning more Snowflake micro-partitions then needed! Because we have stuck to the INSERT-ONLY paradigm this pattern can deal with any velocity of table loads elegantly.

And finally…

Scenario 5: Delta happened earlier

In this scenario the late arriving record must be inserted because Wednesday’s event/state occurred earlier, it happened on Tuesday. You will end up with a duplicate record in the satellite table but now the timeline is correct. Is the integrity of the satellite table now broken? You could argue no, because you are using the point-in-time (PIT) and bridge tables (episode 3) to fetch the single record applicable at a snapshot date, those query assistance tables will pick one record or the other based on that snapshot date.

For this and the above scenarios it does mean that when you have a correction event (scenario 4) you will likely need to rebuild your PIT and Bridge tables and the views based on those query assistance tables does not need any update at all! Remember, in Data Vault query assistance tables and information marts are disposable. That’s what sets them apart of the auditable raw and business vaults!

Orchestration

Alas! Orchestration is vital for making this pattern a success!

An adjacent satellite table can be updated before or after XTS has been updated with the same delta within a batch run. Correcting the timeline is about the record before and after the delta and not about the current delta whether it was or was not loaded into XTS.

Because of Snowflake READ COMMITTED transaction isolation level you do not need to lock the central XTS table for updating or reading — we discussed this point for hub table locking in episode 8. Remember, a raw vault satellite table is single source, therefore you will not have contention in XTS for a raw vault satellite table and therefore you could have as many threads as you like using and updating a common XTS table concurrently and without contention!

XTS influences satellite table loads

Conclusions

We have presented a data-driven and dynamic pattern for Data Vault satellite tables to absorb whatever you throw at it. Ultimately, out of sequence data can be a pain and cause delays, reloads and erroneous reporting from your data platform and this pattern can certainly help in alleviating this issue. Every out of sequence event should be recorded even if it were corrected so we can use that information to solve technical debt upstream. An incorrect state of a business object may have already been reported on via a dashboard or extract before we could have corrected the timeline and therefore it is paramount that you look to the root cause to your automation issues! Ultimately, if out-of-sequence data does not occur, even this dynamic data-driven approach can be turned off and the cost of analytics reduced! In data vault we advocate to push tech debt upstream but realise that some circumstances are unavoidable. XTS provides the dynamic method to absorb that pain.

Until next time!

Reference:

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store