Simple PIT table Constructs
Many data engineers question why data vault chooses to split staged data in your data pipeline into three table types (hubs, links and satellites) only to bring them back together when consuming that data. It is a great question. Data engineers, data architects and solution architects look to efficient methods to automate data movement to answer business cases. Replicating and splitting data seems least efficient and we better have a good reason to do either!
Source application data comes in a variety of forms and data types; structured, semi-structured, 3rd normal form, or even as dimensional models! An advantage of following the data vault architecture and modelling pattern is that no matter what the polyglot form of the source data, or its cadence for providing batched snapshot/incremental or streaming data, the outcome is always the three standard table types we are familiar with — hubs, links and satellites.
- Hub tables form the passive integration point of your business objects between all your source applications (automation engines for your business processes) relevant to the business case you model.
- Link tables are the recorded interactions between business objects as business events, transactions or simply depicting that the business objects are related somehow. Raw vault link tables will record what is depicted in the source applications, business vault link tables will depict the business view of that interaction. They can certainly be different!
- Satellite tables record only the true changes based on what the descriptive attributes are describing from the source data, if the true change attributes are recording business object changes, then they are recorded in hub-satellite tables, if the true change attributes are based on the interaction between business objects, then they are recorded in link-satellite tables. Each extract from a source application is loaded to one or more raw vault satellite tables, we never combine raw data from different sources into a single raw vault satellite table and we never replicate raw vault satellite tables into business vault satellite tables. If we need to further transform that raw data, then we record that output as business vault satellite tables.
Hub tables reflect the top-down model with respect to the business object types a business care about; hub_policy will be about policies, hub_loan_account about loan accounts, etc… By splitting the data into these simple three table types, it means we can:
- Consolidate everything we know about a policy, loan account or any other business object around what has been recorded from each source application as true changes (like slowly changing dimensions).
- We can also consolidate everything we know about an interaction / relationship / business event if that relationship happens to be recorded in different source applications (this rarely happens — it might indicate that you have redundant source applications automating the same business process!). However, the link table does introduce an opportunity to deploy specialist data vault artefacts like having one or more effectivity satellite tables modelled off the same link table.
What can be challenging, is the ability to bring this data together from multiple raw (one source table = one raw vault satellite table) and business vault satellites together. Data vault has thought of this and hence provides the Point-In-Time (PIT) table structures to solve just that.
Let’s go!
Table Structures
Point-in-time tables (are never-ever SQL views) are utilised to take a snapshot (imagine taking a picture) of what are the applicable records for a business object or relationship (parent entity) as of a point in time. For a single PIT table, you will support what’s called “managed PIT windows” and “logarithmic PIT structures” (more on these later), for now consider that:
- Scenario 1: For the current state of an entity or relationship, the PIT table will of course be a snapshot of the current records from the satellite tables surrounding that parent entity.
- Scenario 2: For historical snapshots it will be the state of that information as of that point in time, if the state of a parent entity never changes then the PIT table will refer to the same record since its introduction to a satellite table.
- Scenario 3: For a stateless parent entity data (no descriptive attributes exist for this parent entity yet) then the PIT table will refer to what is called the “Ghost Record” (a completely different concept to the Zero-Key) until such time as records do start to arrive in that satellite table. How did that business key get there you ask? It’s a stub-hub entry 😉
To clarify a few esoteric terms in the above dialogue:
- Ghost record — a single ghost record is inserted into a satellite table when the table is created. This is required to ensure that you will only ever need an SQL equi-join between that PIT table and surrounding satellite tables in a query.
- Zero key — a dynamic concept designed to handle and interpret when a business key column is null. There is no need to pre-load any data vault artefact with zero-keys because they will naturally occur as they are assigned in staging. The dynamic nature of zero-keys also enables multi-cardinality relationship representation in a link table which then guarantees that your joins between hub and link tables are also always an equi-join as well. Both of the above terms are compared and described in this blog.
- As of — what is the state of this business object or relationship as of this point in time. It is a concept we use in time-series analytics and Snowflake has a specific construct called the ASOF join. We do use the same concept (but not the same technique) in PIT table construction and management as an ASOF table ~ its construction and content can be compared to a datetime-dimension from Kimball-styled dimensional modelling (more on this in a later section).
These concepts will begin to become clearer when we combine these sub-topics into a single image shortly!
Data dictionary for a PIT table:
The number of hash-key and applied date timestamp column pairs is determined by the number of satellite tables this PIT table will be built to retrieve data from. Optionally you can add descriptive attributes to the PIT table from those surrounding satellite tables for that parent entity or even (if desired) include some lightweight transformations to derive column values (like rolling balance or simple case statement output values). We intend to keep the PIT table thin and efficient, and because PIT tables are built on an OLAP platform, columnar compression will reduce the storage footprint for repeated values of non-changed surrogate hash keys and applied dates vertically across snapshot date periods.
PIT tables are essentially a copy of the keys and applied dates from raw and business vault and therefore this makes PIT tables disposable. With the PIT table acting as a join-index, the default method of building information marts based on these PIT tables is to deploy them as SQL views. One PIT can have many views to support multiple business cases.
Data dictionary for an ASOF table:
We touched on what an ASOF table is; similar to a date time dimension you will find different formats of the date field and business flags or indicators marking some dates for important business events.
ASOF tables are infrequently updated, in consultation with the business, an ASOF table can incrementally be updated with the next 12 or more months’ worth of snapshot dates and business flags and indicators from a financial year end date.
Let’s combine what we have learnt from above into a simple example (please note, to keep the example simple, we do not apply hashing in these examples):
- Scenario 1 (101): changes every day, therefore each daily snapshot of keys and applied date refers to a new record from the satellite table.
- Scenario 2 (456): does not have descriptive attributes from this satellite table on 1/1, therefore for this snapshot date the PIT table will refer to the satellite table’s ghost record. On 2/1 however, a descriptive attribute arrives, and it does not change on 3/1. Therefore, the snapshot date of 3/1 for this business object will refer to the applicable descriptive attributes as of that point in time, 2/1.
- Scenario 3 (789): Does not receive descriptive details in this satellite table, thus for this snapshot window, the PIT table will refer to the ghost record at each snapshot interval.
The ghost record enables equijoins, you do not need a mix of inner and left joins to support querying from multiple satellite tables surrounding a PIT table. This construct achieves what in the dimensional modelling community calls a star-join, to database developers this is known as the hash-join. In your query plan you should see the nodes of the plan lean to the right and that is called a right-deep join tree. We have discussed this optimization and the SNOPIT implementation in this article.
Management
The standard columnar construct of a PIT table remains consistent, but we have options on how and what the contents of the PIT table encapsulates. Each topic below can be used in combination with each other making the PIT table pattern repeatable and scalable.
Managed PITs Window
The simplest configuration is to define the start and end snapshot date timestamps, we do this by limiting the dates we use from the ASOF date table (or create an independent ASOF table with the dates you need). You can define the snapshot start date as far back as you want but of course there will be a higher cost to building a PIT table if you are planning to take a snapshot of more data. Another technique you might want to consider is when you need new snapshot dates added to a PIT table do you then rebuild the PIT table from scratch (a full refresh) or create a new intermediary snapshot PIT table of the new snapshot dates you need and insert those to the existing PIT table you have already constructed. The cost to build might be lowered in this scenario but then it means you need to build the orchestration and reconciliation tests to ensure the referential integrity between the PIT table and adjacent data vault satellite tables is still correct.
Caution must be exercised when building PIT tables too far back, if no referenceable state data exists for the business objects / relationships you have constructed this PIT table for, then every record in that PIT table (until there is real data in the satellite tables for that parent entity) will point to the ghost record. Snowflake in particular does not perform well if there is an excessive number of join-keys with the same value in the probe-side of a star-join query. This is called probe-side skew and a topic we discussed in this article which includes a solution for dealing with this skew at query time, i.e. in your information marts. Of course, you should not be creating these many ghost record records in the first place but if you have other join-key values that are excessively represented in the PIT table then the article does articulate a method for dealing with that scenario.
Logarithmic PIT
Can you really say that you will need the same granularity of snapshots between fresh data and data that has aged into weeks, months or years old? Would your business case need to know the balance on an account from two years ago, Thursday morning at precisely 9:05 am? More than likely this will not be the case. Here we will propose different PIT table cardinality structures more aligned to realistic business intelligence reporting needs.
What do we mean by this? The data pipeline to populate a PIT table can either load the required frequency of snapshots all at once in a table refresh or (as the previous section suggested) executed through pipeline orchestration to load PIT increments to an existing PIT table. Regardless of which method you choose, you will build or load PIT tables according to a reporting period which you defined in your ASOF table — the increments you choose is up to your business reporting needs, for example:
- Intra-day PIT for data you need snapshots of multiple times in the day, let’s say 9am, noon and 4pm.
- End-of-Day Daily PIT recording in each day’s snapshot even weekends.
- Business Day PIT, only take snapshots according to a business day and exclude public and company holidays.
- Weekly PIT, Monthly PIT, Quarterly PIT, Financial Year End PIT… etc.
- Current Active PIT (CPIT) which only contains the surrogate hash keys and applied date timestamps for the active records for a business entity or relationship.
Depending on how deep your satellite tables have become (billions… trillions of records) perhaps retrieving the relevant surrogate hash keys and applied date timestamps for each PIT table type you have created seems wasteful. If you are building PIT tables in a refresh, then it is more than likely that your PIT tables will have an overlap of surrogate hash keys and applied date timestamps for an increment of the snapshot date. Likewise, this will happen if you were incrementally populating PIT tables.
An alternative it seems is to construct tumbling window pipelines.
Let’s elaborate on this!
Tumbling PIT Windows
If we are implementing logarithmic PIT tables rather than fetching the same surrogate hash keys and applied date timestamps from the data vault (raw and business vault) for each overlapping PIT table, we simply push/extract certain keys from an existing PIT table to another PIT table, either as a refresh or incrementally.
From the PIT table with the highest granularity (we’ll call this the top tier or tier 1 PIT — like a daily PIT), extract the keys and dates you need to populate a lower tier PIT table (tier 2 PIT — like a weekly PIT). The top tier PIT may become large (you could trim older snapshots if it fits your business reporting purpose) and therefore for even a lower tier PIT table (tier 3 PIT — like a monthly PIT) we can retrieve the snapshot keys and dates we need from the middle tier PIT. How deep these tumbling windows cascade (number of tiers) is up to you but of course this method requires more orchestration but could offset the need to churn through very large PIT tables (maybe even consider Snowflake streams to process for this). For older snapshot periods stored in a (for example) daily PIT there may be no need to keep that finer grain of keys and dates, therefore the lower tier PITs act as a summary of that snapshot period.
Implementation Styles
Results may vary! Pick which works best for your situation!
Refresh versus incremental PIT tables
We have discussed an implementation of incremental and refresh PIT tables before in this article. One of the major differences with incremental PIT data pipelines is what happens if a record in the data vault (raw or business vault) is deleted (another is that if you go far back enough the business entity / relationship did not yet exist and therefore would not even have a record in the PIT with ghost record references)? This can occur if it is required by your organization’s legal, contractual and regulatory obligations.
We have discussed the option to isolate identifiers into their own personally identifiable (PII) satellite table and permanently obscure the identifier through an SQL update and leave the record digest (hashdiff) unmanipulated (article 17 of GDPR). This will also mean that if the PII satellite table is included in the PIT table join it does not need to be updated and your PIT table will not be referencing records that do not exist.
Remember, this is an equi-join, if one of the records in an adjacent satellite table is gone then the result is that the entire business object / relationship is not visible in the information mart for that snapshot date.
Using Snowflake, we have demonstrated how the same refresh PIT table data pipeline logic can be used to update itself incrementally as a Snowflake Dynamic Table. You set the desired target lag, and the initial build will be a full refresh but subsequent snapshots from the underlying data vault are incrementally added to the dynamic table. Which increments are added are still controlled by those business reporting flags/indicators you set up in the ASOF table.
Click here to read about the full implementation.
Data dictionary for a SNOPIT table:
Within the Data Vault Alliance community forum, a challenge was proposed to find a way to improve SQL join performance on Snowflake in the light of the fact that Snowflake’s proprietary flocon de neige (FDN) tables. Snowflake does not support table indexes, primary keys and foreign keys. I proposed a Sequence-Number-Only-PIT (SNOPIT) construct as a response to the challenge and published the idea a few weeks later here. More recently SNOPIT has seen real-world implementation as described here.
The idea was simple, borrow a concept from dimensional modelling called temporal surrogate keys, i.e. implement an identity/auto-increment column in every satellite table and instead of extracting the surrogate hash key and applied date timestamp in the PIT table construction, we pull the auto-increment column instead. Because Snowflake’s architecture separates compute from storage by essentially wrapping instances of a cloud service provider’s virtual machine (AWS EC2, Azure Virtual Machines, Google Compute Engine — as Snowflake Virtual Warehouses) to execute queries on top of Snowflake’s FDN storage in a Snowflake-managed bucket, in my mind the crux of a solution for improving querying performance must reduce the network traffic between these Snowflake managed components and the client needing the results of a query. One method to achieve that is to reduce the bytes traversing the Cloud Service Provider’s secure internal network. Using the identity/autoincrement column reduces the number of columns to join on between the PIT table and adjacent satellite tables and the integer data type requires less bytes to represent that record uniquely. SNOPIT is a query engine optimization technique, therefore the same construct can be used if the underlying table was Apache Iceberg.
SNOPIT outperforms the other standard PIT table and delivers the exact same output.
- Was the result of joining satellites without a PIT table — 8 minutes 19 seconds
- Was the result when we used a PIT table construct — 19 seconds
- Was the result when we used a SNOPIT table construct — 15 seconds
For a deep dive into PIT and SNOPIT tables, feel free to watch this webinar.
If we were to reimagine the previous example as a SNOPIT it would look like this:
Parallel loading PITs
Another option for populating PIT tables (and very useful in the construction of CPIT tables) is being able to populate a PIT table at the same time as the data vault table is being loaded. The CPIT is extremely useful in querying for the current records in a satellite table when you do not want to enable auto-clustering for that satellite table. Because we deploy current views by default over a satellite table, what happens under the view is hidden from the user. Underneath you will execute an equi-join between the satellite and CPIT tables as we described in the later section of this article which in Snowflake executes a JoinFilter (aka a bloom filter).
Yet another option for loading PIT table tiers is if we can load them all in parallel and in one statement, yes, we can do that using Snowflake’s conditional multi-table insert as described in this article.
Scenarios
Almost all data vault consumption scenarios are possible if you adopt a bi-temporal data vault. Yes, PITs and Bridges use raw and business vault as a source. Let’s move on…
One PIT, many Information Delivery patterns
PIT tables provide an access path or (as a former colleague put it) the GPS coordinates of where to find the attributes you need for that snapshot date timestamp based on the business object or relationship you’re querying for. A single PIT table is reusable by many business cases seeking the same period of data for their own business needs from the same data vault satellite tables. A satellite table may have one or hundreds of descriptive attributes, the business case that initially brought all those attributes in (including the ones they didn’t need) has now benefitted subsequent business cases by reducing the development cycle needed to bring in those additional descriptive attributes.
This is one of the reasons why we say that the initial business cases used to start up a data vault initiative will cost more in the beginning, but subsequent business cases will deliver theirs at a faster pace and reduced development cycle time. (Think of Gregor Hohpe’s selling options metaphor).
Keep in mind that not all data is of equal value, and if the data you’re extracting from contains application data irrelevant to how the business functions, then you should omit those attributes.
Multi-active records
If it is the case that for a single business object or a relationship multiple records are current and active for that parent entity’s state as of a point in time; it does not affect the standard PIT table construct. Whether you’re reporting from one or more multi-active satellite tables or satellite tables with dependent-child keys, the PIT table is about finding those records applicable as of that point in time. Thus, for any satellite table type the PIT table construct does not change because there is no need to record the satellite sub-sequence key (from a multi-active satellite table).
The same is true if we have modelled batched transactions from a source application. As we have discussed in this article, the way we load batched transactions into a satellite table is by defining either the transaction id or the transaction date timestamp as the dependent-child key of that satellite table (we call it the intra-day key here). Because the batched transactions carry a batch/extract date, i.e. the applied date, the PIT table only needs the surrogate hash key and applied date timestamp to return all the transactions applicable as of a point in time.
You must be purposeful in your information mart design, to combine multi-active records from multiple multi-active satellite tables around a PIT table, some of the multi-active rows should be flattened (perhaps into an array or a JSON key-pair object column). Let’s say the dimensions you wish to include in your information mart are centred around batched transactions you have modelled into a satellite table with an intra-day key, then these transaction records will likely need to be kept at that grain and the other multi-active records flattened in the information mart. In all these scenarios the PIT table construct does not change.
For a SNOPIT to support these constructs, the multi-active portions of the SNOPIT must revert to using surrogate hash keys and applied dates for retrieval (SN+PIT). An exception is when you attempt to combine multiple satellite tables who have dependent-child to return the latest state by that sub-category key; for a given as of date you might need to fetch the state of a dependant-child key that has no updates for consecutive snapshot periods. Fundamentally this means your PIT tables need an additional sub-category column to equi-join on; we will use the record hash (HashDiff).
An additional scenario you must then factor is whether the returned PIT table value should find a match by the dependent-child key in the adjacent satellite table or if the state of one dependent-child key in a satellite should match all records by that same parent key in the adjacent satellite table, a potential for a cartesian product and therefore you must understand the data before deciding what the PIT construct should do.
- Scenario 1: match by dependent-child key
- Scenario 2: cartesian join
What are the alternatives?
PITs and Bridges are not the only methods to get your data out of a data vault! They are merely repeatable patterns proven to make use of the data platform’s OLAP algorithms for efficient data queries.
A few alternative approaches you might consider:
- Querying directly from a data vault –for your SQL savvy business analysts.
- Stem and Leaf approach — an approach described in the “data vault guru” book that caters for the de-duplication of data if you are only selecting a subset of columns from a satellite table.
- Twine — to pivot how you join data based on an anchor satellite table and using UNION ALL to return the as of record state from adjacent satellite tables
- Pushing the content to facts and dimensions either by reusing data vault satellite tables as dimensions and a bridge table that includes derived content or pushing new data from a satellite table using Snowflake streams & tasks or dynamic tables to build type 2 slowly changing dimensions (SCD Type 2).
- If all your data comes from a single satellite table, then you do not need to use a PIT construct at all! Remember that you should include the business key in the satellite table to support this.
- You could even go old school and consider constructing a supernova pattern.
- Have you tried the Activity Schema pattern in Business Vault?
Wrap it up
Data vault hub, link and satellite tables are based on repeatable patterns in modelling and data pipelines. This simplicity, however, enables a plethora of querying patterns that are also repeatable and take advantage of a platform’s OLAP capabilities.
- PIT tables to retrieve descriptive attributes based on a hub or link table, by the way, you could also include the contents stored in a status tracking satellite, effectivity satellite (link based PIT table only) and the extended record tracking satellite (details here).
- Bridge tables to either shorten the path between far flung hub tables in the data vault model, resolve link effectivity or to persist aggregated metrics. We discuss all three here.
We have also shown that using PIT tables is not the only method for retrieving data out of the data vault. As you could see, surrogate hash keys and identity/autoincrement columns are mere tools used to bring the data in the data vault together. You should never see these tools propagate beyond PIT and Bridge tables because they have no business meaning whatsoever.
However you decide to get your data out of the data vault, just ensure it is sustainable, repeatable, scalable and (above all) purposeful.
Until next time!
More references
- Snowflake blog series — https://www.snowflake.com/en/blog/authors/patrick-cuba/
- Data Vault Guru book — https://www.amazon.com/Data-Vault-Guru-pragmatic-building/dp/B08KJLJW9Q
- Zero to Data Vault on Snowflake — https://www.snowflake.com/webinar/thought-leadership/zero-to-data-vault-on-snowflake-2023-04-27/
The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.