Data Vault on Snowflake: How to Twine on Snowflake

Patrick Cuba
The Modern Scientist
11 min readAug 27, 2024

To begin with, what do we mean by twine? In the data modelling community, we refer to twine as the technique to intertwine two or more datasets along a common key and timestamp. In other words, we aim to enrich time-variant data with the correct version of another.

Sounds like an SQL join between dimensional facts and dimensions. Yes and no.

Although you can use twine to join multiple dimension tables around a fact table, Snowflake will choose a different query plan to that of a hash join (build and probe) and that’s because twining is more suited for data engineering workloads dealing with trillions of records then for querying already loaded dimensions and facts traditional dimensional models are based on.

In this article we’ll present use cases for when you would want to use a regular SQL join and where twining is better suited and which data vault structures are suited.

Twine is a thread composed of two or more smaller strands twisted together.

Episode 24: When to Twine in a Data Vault

As varied as customer workloads are, Snowflake will decipher what the correct query plan should be. You can convince Snowflake to take a different query plan depending on the data model and the SQL syntax you use. Snowflake keeps track of table statistics (there’s no need to update them yourself) for all tables within Snowflake; and you can see the effects of using these statistics in query pruning and by the query plan Snowflake executes. Sometimes different SQL syntax will produce the same query result and execute a different query plan without needing to explicitly push SQL hints to Snowflake. Sometimes, even if the SQL syntax is the same, a different query plan is executed because the table statistics used in the query have changed.

Query plan selection will have a profound effect on the performance of your workload on any database and thus it is important to understand where to select one query plan over another. Snowflake will get this correct most of the time on your behalf; but sometimes you must be more explicit about the instructions you send to Snowflake to use your Snowflake resources efficiently.

Data Vault as a discipline and methodology that supports both batch and near-real time streaming semantics in its data modelling patterns. We have previously shown modelling techniques to optimize your batch and near-realtime streaming loading and as well as techniques for getting the data out of a data vault efficiently as repeatable patterns with an emphasis on Snowflake’s query plan.

For loading data into a Data Vault we discussed,

For querying to get data out of a Data Vault we discussed,

Pay close attention to the first two articles on Data Vault querying, we will discuss some terms from those articles relevant to this article.

Let’s kick-off by executing some dimensional modelling queries…

Facts & Dimensions

Fact tables are typically larger than the dimension tables it is related to. Whilst the attributes of a business entity change slowly, typically the metrics, events and activities those business entities participate in together with other business entities update often; from here on we will refer to them as facts. Aggregate these facts related to a business entity and they can change the state of that business entity (like how individual transactions add up to a balance). Here is a breakdown between what may be considered slowly changing attributes and fast additive facts.

Streaming have slowly changed ‘dimensions’ in interval tables

Traditional Kimball modelling tracks slowly changing attributes to a business entity differently to those fast-changing facts. By far the most common method to track slowly changing dimensions is by end-dating (versioning) each record that is no longer the current state of that business entity (aka slowly changing dimension type 2, or SCD Type 2). Facts remain insert only, we load what we get as either at the finest grain (aka lowest cardinality), or as aggregated summaries by a period, for example as monthly balances or accumulated running totals. The difference matters because you know that your facts can either be fully additive, semi-additive or non-additive across time.

Examples of fact types and the different grain application, more complex business rules are in play when the fact type is not additive, and the grain is not at a transaction level.

To join facts to dimensions efficiently we use a temporal surrogate key that increments for every record inserted into a dimension table; we use that temporal column to join to the applicable fact grain. This ensures you only ever need one numerical column to join between fact and dimension and you can slice and dice the facts and dimensions using the date dimension. Pretty standard!

OLAP database platforms like Snowflake use a built-in algorithm to perform SQL join between facts and dimensions called the hash-join. Because Snowflake table statistics are always up to date, Snowflake will recognize the type of SQL join you’re instructing Snowflake to perform (a large middle table surrounded by smaller adjacent tables) and select the Right-Deep join tree query plan.

When we build a bridge table over a data vault like we did in this article; it is essentially a fact table with additional calculated metrics and the temporal surrogate keys from those data vault satellite tables we repurposed as dimension tables. No data move for those dimension tables, and the fact table is disposable.

Always check the query plan! Bridge building used a right deep join tree and cost 19 seconds, bridge querying a further 1 minute 29 seconds! Lesson: grab all the dimensional columns you need in the build!

In bridge building you could end your query right there, supporting facts and dimensions you can support information delivery and business intelligence tools who are expecting Kimball models and not data vault artefacts (especially those that do not support binary surrogate hash keys).

Snowflake will hash every record in the adjacent (dimension) tables in the join through a Build phase and returns every match from the middle (fact) table through a Probe phase. This is called Star-Join optimization, and it is the same technique we discussed in the construction of a SnoPIT for Data Vault. SnoPIT consistently beats the regular Data Vault PIT table query performance because it reduces the network traffic between storage and compute whilst still producing the same result. And it does not break the Data Vault 2.1 standards!

from datavault.lnk_account_customer lnk
inner join datavault.hub_account hub_a
on lnk.dv_hashkey_hub_account = hub_a.dv_hashkey_hub_account

-- chief table
inner join sat_card_transactionfile sat_link
on (lnk.dv_hashkey_lnk_account_customer = sat_link.dv_hashkey_lnk_account_customer)

-- Dimension date D1-D4
left join sat_card_masterfile D1
on lnk.dv_hashkey_hub_account = D1.dv_hashkey_hub_account
and D1.dv_applieddate <= sat_link.dv_applieddate
and D1.dv_applieddate_end >= sat_link.dv_applieddate_end

The Laws of Physics Apply

Star-join optimization performance has a catch, it is extremely performant for data at rest (files). Star-join optimization is an algorithm that makes efficient use of memory to optimise query performance based on data models structured as a star.

  • What happens when our workload exceeds trillions of records that need to be processed near real-time?
  • Do you have the option to model your data into a star before taking advantage of star-join optimizations?
  • What if your data vault is not conveniently structured to optimally utlise star-join optimization?

With time-series data you often do not have the option to carefully model your data in a convenient way and often the analytical need to closer to real-time.

How Twine is different

Instead of joining two or more tables horizontally, we need to pivot our data processing to join our tables vertically. The only common SQL instruction that fits that mode of processing is to execute a UNION ALL instead an SQL JOIN when it comes to batch processing.

Pivot your ideas! Twine can gather multiple timepoints from multiple fact tables too!

Twine describes the method to take one or more tables and apply its time-bounded records to the applicable records in another time-variant dataset, to twine (as described by Lars Rönnbäck in this white paper). The query plan will not be a Right Deep Join Tree but instead a Left-Deep Join Tree otherwise known as a nested-loop join.

Let’s take the data vault we developed in this article and use Twine to combine the following tables:

  • sat_card_transactionfile — batched transactions (facts / metrics)
  • sat_card_masterfile, sat_card_balancecategories, sat_card_transaction_header, sat_bv_cardsummary — dimension table equivalent.
select .id
, ${fact_columns}
, D1 to D4.${dimension_columns}
from (select twine.id
, twine.timepoint
, twine.timeline

-- Nearest Dimension Date – D1 to D4
, max(case when twine.timeline = 'D1' then twine.timepoint end) over (
partition by twine.id order by twine.timepoint) as d1_dim_ValidFrom

, twine.${fact_columns}
from (--Fact table
select h.account_id as id
, h.dv_hashkey_hub_account
, f.dv_applieddate as timepoint
, 'F' as timeline
, f. ${fact_columns}
from datavault.sat_card_transactionfile f
inner join datavault.lnk_account_customer l
on f.dv_hashkey_lnk_account_customer = l.dv_hashkey_lnk_account_customer
inner join datavault.hub_account h
on l.dv_hashkey_hub_account = h.dv_hashkey_hub_account

union all – repeat for each dimension

-- Dimension date D1-D4
select h.account_id as id
, d1.dv_hashkey_hub_account
, d1.dv_applieddate as timepoint
, 'D1' as timeline
, null as ${fact_columns}
from datavault.sat_card_masterfile d1
inner join datavault.hub_account h
on d1.dv_hashkey_hub_account = h.dv_hashkey_hub_account
) twine
) in_effect

-- Left Join Dimension tables D1-D4
left join datavault.sat_card_masterfile d1
on in_effect.dv_hashkey_hub_account = d1.dv_hashkey_hub_account
and in_effect.d1_dim_ValidFrom = d1.dv_applieddate
where in_effect.timeline = 'F';

Instead of joining tables by a common key; a UNION ALL sets each table in the query on top of each other, the order does not matter, what matters is for a single record what is the maximum (or nearest) date from the adjacent dimensions. We then fetch the dimension attributes we need for that nearest date match.

This would work equally well even if we were combining facts tables around the nearest dates. PIT tables function in a similar way except we base the snapshot dates on the intentional application of a cartesian product. We turned the PIT code around and improved by over 100% and used it in a Dynamic Table in this article.

What it is to twine… achieves Right Deep Join Tree on the Twine result to bring the attributes nearest to the Fact timeline and a saving of 2 seconds!

Star-join queries are optimal for getting information out quickly from data models modelled into facts and dimensions, they are not particularly useful for merging data between trillions of records and dimensions applicable to those metrics at a point in time, and worse if that table is in motion (streaming). Instead, to efficiently bring this data together like a zipper (as my colleague Greg Pavlik calls it) we need to twine. Snowflake provides a simpler SQL abstraction for achieving this through the SQL ASOF join instruction. Observe…

select .id
, ${fact_columns}
, D1 to D4.${dimension_columns}
from datavault.sat_card_transactionfile f
inner join datavault.lnk_account_customer l
on f.dv_hashkey_lnk_account_customer = l.dv_hashkey_lnk_account_customer
inner join datavault.hub_account h
on l.dv_hashkey_hub_account = h.dv_hashkey_hub_account

-- ASOF Join Dimension tables D1-D4
asof join datavault.sat_card_masterfile d1
match_condition(f.dv_applieddate >= d1.dv_applieddate)
on h.dv_hashkey_hub_account = d1.dv_hashkey_hub_account;

Not only is the code simplified, the query plan, resource footprint and execution times are smaller too.

For very large fact tables ASOF join perform very well!

This is all well and good and we can efficiently process trillions of records in motion far faster than we did before; but what is the value of querying trillions of metrics to a dimension at a point in time? None, humans will not rationalise with raw facts at the lowest grain, but humans will want to understand the aggregate of those metrics at a point in time and trends as metric data is processed incrementally. Yes, ASOF joins are useful for what in the microservices (Domain Driven Design) industry calls Command Query Responsibility Segregation, or CQRS. CQRS separates data by the grain likely of use to your business case:

  • Commands are immutable instructions (events) persisted into a Write Model.
  • A separate and mutable Read Model is used to aggregate events from the Write Model.

In Snowflake we can mimic this pattern by

  • pushing our metric data into an Insert-Only table (Write Model) from Kafka and Snowpipe Streaming; and
  • using ASOF joins in a Dynamic Table (Read Model) to aggregate our facts.
Write (non-historized) and Read (information mart), oversimplified!

As new metrics are processed at a point in time incrementally, the Dynamic Table’s SQL syntax transforms and joins that data incrementally providing an efficient construct as an information mart at the grain needed for your business. We are discussing analytics that need near real-time latency, if you do not need a 5-minute window from business event to analytical value then you should look to regular batch / micro batch architecture patterns.

Using the right tool for the job

Because Data Vault tables are Insert-Only (satellite tables do not have end dates) it may be tempting to utilise ASOF joins to bring satellite tables together to build PIT tables for optimal querying. An ASOF join will in fact produce the same output when constructing a PIT table as when you utilise a regular left join. This might be a useful use case if one of the tables in the join is non-historized (streaming). Let’s examine the difference when we do use an ASOF join instead of the standard EQUI-JOIN to construct PIT tables.

Star-join outperforms in PIT building, limited to the month of January

Yes, a Left-Deep Join Tree is executed in the query plan which is more memory efficient but will not be as performant in constructing PIT tables (the output of which is used to perform hash-joins between PITs and satellite tables). However, if you are loading trillions of records into data vault non-historized link or satellite tables you should then consider supporting the Read Model with the Dynamic Table using an ASOF join as we described earlier, the dynamic table is your information mart in data vault parlance.

We hope this provided an adequate guide of when to use ASOF joins for your SQL-based workloads and when the regular constructs you’re familiar with (Left, Right and Inner Joins) are more appropriate. Snowflake announced the Snowpipe Streaming construct in June 2023 followed by Dynamic Tables and ASOF joins that are designed to simplify and optimise how you can rationalise with your near-real-time streaming data needs.

It is important to understand when and where to use these constructs to ensure your total cost of ownership rarely exceeds your return on investment.

Until next time!

References

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

--

--

The Modern Scientist
The Modern Scientist

Published in The Modern Scientist

The Modern Scientist aspires to connect builders & the curious to forward-thinking ideas. Either you are novice or expert, TMS will share contents that fulfils your ambition and interest. Write with us: shorturl.at/hjO39

Patrick Cuba
Patrick Cuba

Written by Patrick Cuba

A Data Vault 2.0 Expert, Snowflake Solution Architect

No responses yet