Data Vault on Snowflake: Data Retention & Storage Tiering — Reporting
This article is an addendum to the previous article detailing the use of Data Vault’s eXtended record Tracking Satellite (XTS) for record management. As a refresher, we added the following columns to a data vault satellite table
- dv_snowflake_query_id — Generated GUID, recording the query that loaded that record to that satellite table. With this column we can retrieve more query details and related artefacts from the Snowflake.Account_Usage schema, for example
o Using query_history we can articulate the query that was executed to load the satellite table, this is the compiled query. In other words, as parameterised data vault automation do — the SQL code that gets executed on Snowflake is the compiled code with all macros and parameters resolved.
o Using access_history we can trace lineage (source & target) for each satellite table and what policies were in effect at the time of execution.
- dv_xts_event — ‘insert’/’copy’, identifying that if the record inserted into the satellite table is a regular insert or is the result of a timeline-correction (copy) event
We also added the following metadata columns to XTS:
- dv_sequence_violation — True/False, the record we’re loading is older than a record we have previously seen for a business entity. Relate this to scenarios 4 and 5 in this article.
- dv_record_retention_state — ‘Active’/’Archived’/’Purged’, stipulating in what state the record for the business entity or transaction is.
- dv_disposal_record_requested — True/False, identifying if the record has been requested to be forgotten.
- dv_disposed_record_reemerged — True/False, identifying that a previously obfuscated record has re-emerged.
And now to demonstrate the types of reporting we can derive from XTS…
COPY (out of sequence) events
Highlighting that data is arriving out of sequence should be a focus of concern for the health of your data orchestration.
Record State / Certificate of Destruction
Generally, as your data ages you will see more data going into archival and purging.
Re-emergence Report
Another report that should highlight a concern that data you shouldn’t have exists in a source application!
An image is worth 1000 words
Keeping a full audit history of the state of your data is an important part of data management. This approach with XTS ensures that we capture only that state, and nothing that can be used to reidentify those records. Ultimately, data management like data vault is a discipline and requires the same rigour as the data vault approach itself. Data vault in turn has the repeatable patterns you need to manage data management effectively.
References
- Storage tiering, coming soon…
- Streamlit in Snowflake — https://www.snowflake.com/en/data-cloud/overview/streamlit-in-snowflake/
- Snowflake Notebooks — https://docs.snowflake.com/en/user-guide/ui-snowsight/notebooks
- Snowflake Streamlit on Github — https://github.com/PatrickCuba/the_data_must_flow/tree/master/record_retention_reporting
The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.