Snowflake
Published in

Snowflake

Data Vault on Snowflake: Hub locking on Snowflake

Snowflake continues to set the standard for Data in the Cloud by taking away the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. You will expect the same relational capabilities for your data model as any other platform and Snowflake certainly delivers. There will be scenarios where you may need to consider transaction isolation and such a scenario does exist for Data Vault, the common hub table.

Episode catalogue,

10. Out-of-sequence data

9. Virtual Warehouses & Charge Back

11. Handling Semi-Structured Data

A reminder of the data vault table types,

Episode 8: Hub locking on Snowflake

We explored the concept of Passive Integration in the previous episode. This is of course the ideal situation for a Data Vault model, from a technology perspective if we are looking to integrate by and concurrently load to common hub tables per business object definition. Given that Snowflake is READ COMMITTED transaction isolation level, how then, do we guarantee that independent hub table loaders leave the target hub table with the same integrity after the load? Let’s explore how this is done in Snowflake and what this could mean to your data vault model.

Parallel loading and eventual consistency

As depicted above, any landed source data is staged and loaded to modelled hub, link, and satellite tables.

Notice how each source loads to a common hub table in the above example, they must leave the hub table in the same integrity as each hub-loader found it; a unique list of business objects defined by business key + business key collision code + multi-tenant id.

The Problem

The problem is the same as what we described in episode 6 when discussing conditional multi-table inserts; if each thread attempting to load to the same target hub table is executed at the same time, due to the nature READ COMMITTED transaction isolation each thread views the target table without UNCOMMITTED transactions coming from other threads. Observe…

The Problem

Duplicates break the integrity of the data vault model, and you will start to see information marts based on these structures starts to suffer too.

By the way, one of the tiles in our dashboard (episode 2) was checking for this issue!

The Solution

Snowflake does in fact allow for table locking and the syntax is simple, change the SQL INSERT statement into an SQL MERGE statement and the target hub table defined in the MERGE statement will be locked when it is its turn to update the target table. Of course, this implies that the hub-loader template is changed, and all configured hub-loaders will then use the SQL MERGE INTO statement instead! Observe…

The Solution

Three SQL Data Manipulation Language (DML) lock Snowflake tables for updates,

On the grand scheme for our automation the first animation can be updated to the following.

Parallel loading and eventual consistency and hub locking

Does this approach add latency? Probably, but it will be so minor because hub tables typically only process a few condensed records at a time using anti-semi joins. But the beauty of this approach is that we did not need to explicitly lock the target table using externally defined semaphores, Snowflake instead randomly decides which thread will get processed first and for hub tables, that’s just fine!

While you’re there…

SQL MERGE statements allow for SQL INSERTs and UPDATEs, it is an opportunity to explore a Data Vault artefact that was previously deprecated, the “Last Seen Date” column in the hub and link table. Because hub and link tables are wafer thin and tend to be short, the number of micro-partitions that make up these tables is therefore very small. Why not then allow for SQL UPDATEs to the hub and link table? To revisit how Snowflake stores tables as micro-partitions please visit episode 1, bit.ly/3LWR8qR

Yes, the cost of SQL UPDATEs were expensive operations, but these are true for satellite tables where we might see as little a single descriptive column to hundreds of them. SQL UPDATEs are still not recommended here! Let’s update our previous example to show where Last-Seen-Date can be useful. Observe…

The Suggestion

Other Data Vault artefacts

Certain other Data Vault artefacts were developed to track this very information because they are INSERT-ONLY table structures, these are:

Compare SAT, RTS, STS, HUB Last Seen Date

Because “Last Seen Date” only records the latest occurrence of the business object or unit-of-work it is not a reliable source to check:

Driving Key and Effectivity

If the data source does not contain a business date tracking the effectivity of the driving entity of the relationship, or you wish to track effectivity of a different driving entity than that of what is tracked in the data source then the need for an Effectivity Satellite arises, see: bit.ly/3oS4k70. Yes, a LAST_SEEN_DATE column in the link table will give you what the current relationship is without needing one of the most complex data vault patterns, let’s explore by way of an example.

What you lose if you don’t use EFS

No matter what the driving key is when utilising the LAST_SEEN_DATE you will get the current active relationship for that driving key/relationship. It also does not require that you deploy multiple effectivity satellites for each driving key you want to track on a single link table. However, you will not be able to trace the historical movement of that driving to non-driving key relationship, that is the exclusive realm of the Effectivity Satellite unless (again) the source system provides this movement.

To lock or not to lock

Always test these scenarios for yourself! The idea behind this technique is to take advantage of what Snowflake technology provides and keeping Data Vault automation still single purpose.

As quoted from the Zen of Python: “There should be one — and preferably only one — obvious way to do it.”

Leverage Snowflake technology to ensure you meet your automation goals.

Reference:

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store