Snowflake
Published in

Snowflake

Data Vault on Snowflake: Immutable Storage, Virtual End-Dates

Snowflake continues to set the standard for Data in the Cloud by taking away the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. Through this and more episodes we will discuss some Snowflake features you should consider that will enable your Data Vault scale as dynamically as Snowflake scales.

Over the next few months, we will release episodes to cover:

  1. Immutable Store, Virtual End-Dates
  2. Snowsight dashboards for Data Vault
  3. Point-in-Time constructs & Join Trees
  4. Querying really BIG satellite tables
  5. Streams & Tasks on Views
  6. Conditional Multi-Table INSERT, and where to use it
  7. Row Access Policies + Multi-Tenancy
  8. Hub locking on Snowflake
  9. Virtual Warehouses & Charge Back

A reminder of the data vault table types,

Immutable versus mutable objects is the difference between an object that cannot be changed and an object that can be changed. To Snowflake these are the compressed and encrypted immutable 16MB micro-partitions that make up Snowflake mutable tables. Snowflake micro-partitions immutable files that are not visible to the Snowflake user, rather the table acts as a container of micro-partitions and that is what the user will interact with. Some of the interactions are:

  • An SQL INSERT operation that will load new records as new micro-partitions
  • An SQL DELETE operation that will commit records and their micro-partitions to time-travel and,
  • An SQL UPDATE operation that will INSERT new records to the table and commit the old state of that record to Time-Travel

More on time-travel in a bit…

Simplified data representation, hash key will be a deterministic digest value on a parent key

A data vault satellite table contains the descriptive state of a business object (based on a hub table), or the descriptive state of a unit of work (based on a link table). A satellite table appears as a Kimball Slowly Changing (SCD) Type 2 dimension complete with start and end dates and each SQL UPDATE by the parent key (hash-key) will in fact INSERT two records into the table. As demonstrated above: HASH KEYs 3, 5 and 6 have updates to an existing record and therefore we need the new state of these records to reflect what they look like once the SQL UPDATE has completed. And remember, micro-partitions are immutable.

An SQL SELECT query before the update on January 4th returns:

  • HashKey 1, StartDate: 01-JAN-2022, EndDate: 31-DEC-9999
  • HashKey 2, StartDate: 01-JAN-2022, EndDate: 31-DEC-9999
  • HashKey 3, StartDate: 01-JAN-2022, EndDate: 31-DEC-9999
  • HashKey 4, StartDate: 02-JAN-2022, EndDate: 31-DEC-9999
  • HashKey 5, StartDate: 02-JAN-2022, EndDate: 31-DEC-9999
  • HashKey 6, StartDate: 03-JAN-2022, EndDate: 31-DEC-9999
Satellite table as a Snowflake Base Table with Time-Travel and Fail-Safe

At rest, the pre and post SQL UPDATE records exist on disk at the same time, however the pre-SQL UPDATE records are only available by using time-travel. Time-travel is not the same as an SCD Type 2 dimension, rather think of time-travel as a live back up of the satellite table instead. Time-travel can be set from 0 to 90 days and retrievable using Snowflake-extended SQL time-travel syntax, once the records fall out of the time travel period, they are set to the 7-day fail safe period and only retrievable by contacting Snowflake support. Beyond those seven days those records are purged from Snowflake and unretrievable.

  • Permanent tables can have time travel set from 0–90 days, 7 fail safe cannot be changed
  • Transient tables can have 0–1 days of time travel and does not have fail safe.
Time-Travel is a live backup of your data you configure using Snowflake extended SQL

Querying the table after the update returns the current state of the records within that table, querying using Time-Travel returns the previous state of the table. Note the storage needs for HASH KEYS 3, 5 and 6. Therefore an SQL SELECT query on January 18th returns:

  • HashKey 1, StartDate: 01-JAN-2022, EndDate: 31-DEC-9999
  • HashKey 2, StartDate: 01-JAN-2022, EndDate: 31-DEC-9999
  • HashKey 3, StartDate: 01-JAN-2022, EndDate: 15-JAN-2022
  • HashKey 3, StartDate: 16-JAN-2022, EndDate: 31-DEC-9999
  • HashKey 4, StartDate: 02-JAN-2022, EndDate: 31-DEC-9999
  • HashKey 5, StartDate: 02-JAN-2022, EndDate: 16-JAN-2022
  • HashKey 5, StartDate: 17-JAN-2022, EndDate: 31-DEC-9999
  • HashKey 6, StartDate: 03-JAN-2022, EndDate: 16-JAN-2022
  • HashKey 6, StartDate: 17-JAN-2022, EndDate: 31-DEC-9999
  • HashKey 7, StartDate: 16-JAN-2022, EndDate: 31-DEC-9999
  • HashKey 8, StartDate: 16-JAN-2022, EndDate: 31-DEC-9999
SELECT Query using Time Travel
Data Vault 2.0 Satellite tables has no END-DATE column

Data Vault 2.0 recognised the expense in executing SQL UPDATE operations for some years now and with the advancement of the SQL language itself Data Vault 2.0 as a practice have deprecated the END-DATE column on the satellite table. Using SQL LEAD Analytical function, the end date of a parent key is now virtualised by defining an SQL VIEW over that satellite table right after the satellite table has been created.

LEAD() analytical function
Create View as Select from Table

As the data is an SQL INSERT operation with no SQL UPDATE operations needed this means that table updates are only ever inserts of new records. This also means it does not matter at what velocity the data arrives in, Snowflake is not rapidly or slowly churning new micro-partitions trying to keep the table up to date and exploding in the number of micro-partitions in time-travel and fail-safe storage if it were a high-churn table. It also means that the Snowflake table does not need to have column clustering set, the satellite table will naturally cluster by START-DATE. Most analytics is based on the current-date by parent-key anyway and therefore there is no need to recluster the table at all.

Cloning and Time-Travel

Zero-copy clone

Finally, a clone of a satellite table can be created across environments within a Snowflake account. A clone is a snapshot of the table at that point in time and any new data added to PROD is not accessible by DEV (where the clone exists) and any new data added to DEV is not accessible by PROD. This Snowflake patented capability accelerates your DevOps processes by making instant copies of production quality data available to develop on and no storage is replicated, only the metadata is.

Cloning can be extended to cloning entire schema or even entire databases and you can even perform a schema swap for blue/green testing!

For example:

STEP 1: Clone schema to a DEV schema, perform changes

STEP 1: Clone

STEP 2: Catch up data loads to Dev clone

STEP 2: Catch-up

STEP 3: Swap Dev and Prod Schema

Schema Swap

In Summary

Separation of compute and storage and managing micro-partitions based entirely on metadata accelerates your DevOps processes and with a Data Vault 2.0 being an INSERT-ONLY methodology it naturally fits and scales as Snowflake scales. There isn’t a need to track the previous state of a record when none is persisted to Time-Travel.

However, you can indeed use Snowflake’s Time-Travel feature to take a clone of your Data Vault at a snapshot point in time to accelerate your DevOps development and testing or to whatever your heart desires!

Data Vault Automation and Orchestration

Reference:

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store