Snowflake
Published in

Snowflake

Data Vault on Snowflake: Conditional Multi-Table INSERT, and where to use it

Snowflake continues to set the standard for Data in the Cloud by taking away the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. You will expect the same relational capabilities for your data model as any other platform and Snowflake certainly delivers. In today’s episode we will explore the appropriate place to use a Conditional Multi-Table INSERT for your Data Vault, and where not to use it!

Episode catalogue,

10. Out-of-sequence data

9. Virtual Warehouses & Charge Back

11. Handling Semi-Structured Data

A reminder of the data vault table types,

We will again be using the PIT query assistance table in our Data Vault implementation.

Episode 6: Conditional Multi-Table INSERT, and where to use it

Snowflake’s SQL multi-table insert allows for you to insert data into multiple target tables in a single SQL statement in parallel from a single data source. The SQL conditional multi-table insert extends that capability by allowing for WHEN conditions to be configured for each of the target tables based on the content of the source data. Think of it like a CASE… WHEN condition you’d see in an SQL SELECT statement that executes IF…THEN…ELSE logic, except in this case 😊 the conditional output is defining the target table to send data content to!

Let’s see what that looks like:

from Snowflake docs, see bit.ly/3h97Ypz

In a Data Vault context there is a smart way to use the conditional multi-table insert and make it completely business driven. I’ll show you what I mean shortly, but first…

Where it should not be used

As data is landed before the data platform in a landing zone the content is staged by applying hard rules like trimming columns and ensuring the timestamps align. Data Vault staging will also add data vault tag columns, such as

· Surrogate hash keys — destined to be loaded to hub tables as hub-hash-keys, and if a relationship in the data exists, will also be destined to be loaded to link tables as link-hash-keys

Here it might be tempting to use this staged content as a base for a multi-table insert

a simple model, a business object, and its descriptive attributes

Table names and content have been simplified for clarity

SELECT DISTINCT will eliminate duplicates in the staged content and the data vault model integrity remains valid. In reality, data does not arrive in such perfect harmony, source-system data is profiled and will be mapped to hub, link, and satellite tables but what if we have modelled the following scenario,

a more realistic model, what does it mean for a multi-table insert?
multi-table insert will lead to model integrity issues

Integrity Lost! Why did it happen?

There are essentially two problems highlighted with using a multi-table insert here:

1. Record condensing

Data is profiled and modelled to specific target table grain. Our example identified a relation between two key columns that represent a linking of cards, since both are essentially card numbers they will be loaded to the same hub, hub_account (Hub1). A 3rd portion of the relationship exists and that is with the customer, those keys are loaded to hub_customer (Hub2). We also applied some satellite splitting because not all the content is about the related cards. Since this is coming from a single staged file the content used as a base for a multi-table insert statement must use a single SELECT statement over this data. A single SELECT DISTINCT does not distinguish which portions of the same file to apply this rule to, it applies to all and hence why the hub-satellite content is not deduped.

2. Multi-table insert threads are executed in parallel

This means if we were loading to the same hub table at the same time each thread executing the same load condition is not aware of the other thread’s execution condition even if it’s coming from the same SQL statement.

Remember, Snowflake uses a READ COMITTED transaction isolation level and thus an INSERT statement does not lock the target table for inserts and each executing thread will see the table without the other thread’s uncommitted statements. In other words, the parallel execution of the multiple threads in a multi-table insert sees the same target table state with each thread unaware that another thread is attempting to load to the same target table with the same load conditions with potentially overlapping content!

independent, configured hub, link and sat loaders

READ COMITTED transaction isolation is excellent for keeping the platform scalable and allowing for the extreme levels of concurrency possible on Snowflake by not risking object locking and possible race conditions! In a future episode we will discuss how to do table locking on Snowflake and where we will need it.

You could imagine that the above scenarios could be coded for using multi-table inserts, and you could design and build a conditional ingestion pattern that if you encounter this modelling scenario then use Pattern A else Pattern B and so on. This code suite will become monolithic and tedious to maintain! Just how many scenarios would you have to cater for?

The essence of an architecture approach that uses repeatable patterns does not have such switch conditions to load the same table type. Rather there should be only one way to load a hub, a link, or a satellite table and the automation of these patterns reuses those simple building blocks through parameterisation to deploy this as an idempotent technique, like service-orient architecture. No matter the data vault model, you simply configure the hub, link, and satellite loaders as such.

Where can we use multi-table inserts? Well…

Where you could use it,

Let’s discuss the components of the above architecture

· the example data model in the above animation contains a hub and two satellite tables and data are constantly being loaded to those tables using independent and parameterised hub and sat-loaders respectively — the loaders are executed in parallel. PIT flow manifold in the other hand can be executed in one of two ways:

1. as PIT table rebuilds — a window is selected as a time period for reporting, if we were loading for a period window from 1 Jan 22 to 30 Jun 22 and when parsed through the manifold the as_of table’s flags control the flow of surrogate hash keys and load dates directed towards the logarithmic PIT table constructs. For our illustration above, let’s make a simple example without showing table growth. Each satellite gets 1000 records a day and only deltas are loaded, so if we attempted to load 1000 records a day for the month of January, we will likely have less than 31,000 records in the satellite but 31,000 recorded keys and load dates in the daily-PIT table. Here’s a summary for each PIT table.

Remember, satellite tables only load changes to the previous current state, so if for a consecutive day the record state hasn’t changed then that record is not loaded.

2. as a part of the data pipeline — yes, you could be running the flow through the PIT flow manifold daily and incrementally publish keys and load dates to the logarithmic PIT tables. The PIT window in this case is the single date this publishing of keys and load dates is run on. If the number of hub business keys remains consistent as they do in the above example, then you will expect the same record count as the above table. If (as a prospering business does) the hub tables are growing, then the record counts will differ between this pipeline PITs and rebuilt PITs.

In both implementation types the code is identical, the variance is only in the selection of a PIT-window.

How do we make this business driven? Deploying the switchboard as reference table controlled by reporting needs of the business. The code for the PIT flow manifold is deployed once and it sends keys and load dates to the target PIT tables controlled by the switchboard! The animation above was purely for illustration but essentially you can design the as_of switchboard table to include switches for

Observe…

my_loaddate is the PIT Window, each PIT output is logarithmic

For PIT table rebuilds change the PIT window to a start and end date

Query assistance tables (PITs & Bridges) are disposable and only used to store keys and very light derived content — the type that does not need to be stored permanently because the metrics used for this calculation are stored in the raw and business vault. Because these constructs use inherent data platform techniques to boost query performance and simplify table join the information marts are intended to be deployed as views. Multi-table inserts is just another technique we can use in Snowflake to simplify our Data Vault deployment even further.

Other performance tips:

Reference:

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store