Data Vault on Snowflake: Handling Semi-Structured Data
Snowflake continues to set the standard for Data in the Cloud by taking away the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. You will expect the same relational capabilities for your data model as any other platform and Snowflake certainly delivers. Common data vaults have been built on structured tabular data, with Snowflake you can load semi-structured data (like JSON and XML) and query them in place. How do we design our data vault to cater for semi-structure content?
11. Handling Semi-Structured Data
A reminder of the data vault table types,
[BONUS] Episode 11: Handling Semi-Structured Data
Pay now or pay later; semi-structured data delays the need to design strongly typed database structures to store data. Data producers are free to push data as predefined key-pairs whilst it’s the job of data consumers to parse the semi-structured into a meaningful format, i.e., schema-on-read. Amongst the other benefits of semi-structured data, you will find:
- Semi-structured data is not strongly typed, reducing the need to define the schema upfront. However most BI tools need the content in a structured format and therefore a degree of parsing (data typing) is required before the data is usable by most frontend tools.
- Data appears as key-value pairs within documents (or objects) whereas structured data is tabular defined as rows and strongly typed columns. Not only can you include the typical primitive data types as we see in structured data, but you can natively store complex data types such as lists, dictionaries, and arrays.
- Evolving the schema is easier and more flexible and scalable. There is no need to migrate the structure from one tabular form to another and define common relational table constructs such as indexes, primary and foreign keys.
- Relationships between keys can be embedded within a single semi-structured document, persisting where they do exist and omitting them when they don’t.
Snowflake makes semi-structured even more attractive by providing support in its extended SQL the ability to query JSON and XML in place within Snowflake’s VARIANT column data type. Strictly speaking, semi-structured data can be provided as a part of a batch-oriented workload or as a real-time streaming workload. Whilst the traditional batch-oriented workloads are file-based and have been the mainstay for decades as a cheap way push/pull content into an analytical data platform due to use cases not needing the lowest latency possible to derive analytical value. And these are the traditional requirements such as daily, weekly, and historical reports and dashboards. Streaming use cases, however, needs real-time responses and analysis and are often about processing current data or as close to the business event that triggered the processing event.
In the context of this episode, we will describe consideration for semi-structured data within the data vault context regardless of the data were streaming, micro-batch or batch oriented. At the end of the day, the data will come to rest somewhere and if it is not needed right away it can certainly be used for future analysis.
Where do we see semi-structured?
As you may have heard, the elements of Big Data include volume, velocity, variety, veracity, and value. With Snowflake’s native support for semi-structure data, you can bring your use case from a variety of sources such as:
- Internet of Things (IoT), ex. smart healthcare, vehicle-to-everything (V2X), smart buildings, smart homes and appliances, and transportation et al.
- Web, clickstream, social media, email, and ride sharing applications
- Computer logging and monitoring, SIEM (Security Information and Event Management)
- Stock exchange, inventory management, fraud detection
- And more… oh my!
With Snowflake the data ingestion does not need to be in Snowflake’s native tables (although preferred), semi-structure can also be ingested through external tables such as parquet, AVRO, ORC, JSON and XML as well. Should you choose to ingest (COPY) into Snowflake’s native tables then look to options such as:
- Using a partner ELT tool to configure and orchestrate data ingestion
- Using cloud service provider’s messaging service to trigger Snowpipe to ingest new data
- Combine Kafka (open-source and confluent), or another message/streaming-based tool (ex. Amazon Kinesis) to notify Snowpipe via Snowflake’s Snowpipe REST endpoints
Does Data Vault support semi-structured data?
Yes, and here are guidelines and considerations when designing semi-structured data use for data vault on Snowflake.
1. Keep semi-structured data within the structured satellite table
To enable this and to continue to support scalability of data vault on Snowflake without the threat of refactoring we will introduce a new data vault metadata column, dv_object. This new column will store the original semi-structured data as a data type VARIANT and on Snowflake this could be populated with either JSON, XML, array or object semi-structured content that can be queried in place.
2. What is the parent entity of the satellite?
A satellite is either the child table of a hub or a link table, profile the semi-structured content to identify the business keys the descriptive content is based on. If the content transactional then we are likely going to deploy a link-satellite table, keep in mind that link tables are designed to serve many-to-many/associative (M:M) cardinality but can also house explicit or mandatory (1:1 / 1:M) and optional (0..1:1..M) cardinalities within that same link table.
3. Pre-parse essential content into Raw Vault satellites
Semi-structured content is slower to query and if the content is nested or hierarchical you may need to flatten the content before getting to the attributes you need. As a part of the staging step and because of your data profiling for the essential (ex. critical data elements) attributes you need to persist those key-pairs as strongly typed structured columns within the same satellite table. These could also be columns that can improve pruning performance on Snowflake.
If business event to analytical value latency is critical, avoid flattening nested content in staging, instead allow for the dashboard to pick and flatten the content at query-time.
From the JSON in dv_object we have extracted
- the timestamp (business event timestamp) to be used as the applied date column.
- contid is the business key, but as you will see below it is an untreated business key, the treated business key is sent to the hub table
- eventID is an internal id used by the source system
These simple extractions are viewed as hard rules and will make querying the satellite table faster. Notice how we did not flatten the nested content, doing this at ingestion time causes delay in loading the satellite table.
4. Do we pursue creating hash keys or build a satellite table with natural keys only?
You could do either, if you choose to use natural keys without hash keys then those natural keys must persist in the satellite table to join to all the respective hub tables. You could choose to do this because you have deemed that the act of hashing business keys to produce hash-keys is costly and time consuming. Understand though, that this means that queries running on this satellite will essentially need to use more columns to join on.
- A data vault should be produced as either natural-key or hash-key based and not a mix of the two. Introducing a mixed architecture pattern essentially means that you must be selective on how you use each table in data vault which is bad practice and introduces inconsistencies into the data vault experience.
- Using concepts like business key collision codes and multi-tenant ids means that you must include these columns in every SQL join you perform on data vault which again can lead to an inconsistent experience on the data vault
- We can mitigate the cost of hashing by using Snowflake features like Streams & Tasks which we have shown in a previous episode how it can minimise the amount of data being processed into data vault and therefore reducing the cost to hash.
5. Does the semi-structured content include personally identifiable or sensitive content?
An important step when profiling data is identifying sensitivity levels of the content. You have several options to consider here:
- Isolating semi-structured content in a satellite splitting activity and obscuring the content from the general business user
- Use Snowflake’s data masking features to mask semi-structure content in place (uses Snowflake’s role-based access control and dynamic data masking). See: bit.ly/3nms1Tz
- Obscure the entire VARIANT column and only persist the authorised key pairs as structured columns and applying masking/tokenisation as hard rules.
- Pre-parse the content before the data is even loaded to Snowflake by hashing/encryption that content in place, see “Crypto Shredding” — bit.ly/3bts6SB
- Split the semi-structure content itself between sensitive and non-sensitive content
6. Do we load the content to data vault at all?
This likely being auditable data about one or more business objects than that answer is yes. The subsequent question is, when do we load to data vault? If data vault serves as a bottleneck to a business case, then you could instead of pushing the data through data vault to serve your realtime needs, offload the data into data vault asynchronously as it is being served to your real-time customers. The value in loading this data into data vault is the auditable history, business context, and the subsequent batch-oriented analysis that the data could serve.
The data is immediately available to the information mart view based on data vault, or it can be further processed into business vault as we will see below
7. Always new, immutable content by definition
Streaming events are always new events. No past updates are permitted and therefore the standard satellite table loader can be avoided. The standard loader checks if new content is different to the current record per parent key in the target satellite table. Because our content is always new such a check is no longer necessary, we simply load what has arrived. The same replay-ability is possible when using these non-historised satellite or link table loads in combination with Snowflake data pipelines (streams & tasks) and because we INSERT without needing to check if the record is new the target satellite table will also not need to have a record-hash (hashdiff) column. This also makes the applied timestamp in a streaming context, the business event timestamp…
8. Is the content overloaded?
An overloaded data source is the result of combining far too many business processes into a single data file. In such a situation we look to remedy the issue by pursuing a solution that we will rank from the favourite to least favourite option, starting with:
a. Solve it at the source — operational reality may mean that this path is not feasible, or at least not prioritised yet.
b. Solve it in staging — a routine within staging or before staging must be designed and built to filter out the essential from the overloaded content but it does mean it is a point of maintenance on the analytics team to pursue. Build error traps, keep all content somewhere to remain auditable with overloaded content persisted in its own satellite table. Minimise duplication of data.
c. Solve it using Business Vault — the least favourite option as always. This will result in the need to maintain and manage more tables and introducing more tables to join.
9. What about late arriving data?
Streaming ingestion and late arriving data are a different beast to dealing with late arriving batch data. This is best handled by the platform itself through streaming concepts such as windowing, watermarking, and triggers (to name a few). A link in the reference section below is provided with more details in how this works, ultimately it is not something data vault is expected to manage.
10. How do you test semi-structured data?
Finally, if the automated test framework we previously described in an earlier episode were to be executed after every load it would introduce unnecessary latency between business events and analytical value. Instead, streaming content reconciliation between what was landed and what was loaded should be checked on a scheduled basis, for example once a day This can be further optimised by using Snowflake streams to compare only new content in the landed area against only new content in the target data vault tables.
Querying Semi-Structured data
With the content prepped for efficient querying, the semi-structured data can evolve freely without impacting the integrity of the satellite table. Nested arrays can be queried in place, or we can look to use business vault to flatten out the content so that you hide this querying complexity from the business end-user.
And to do that we can use Snowflake’s lateral flatten operation…
Here we have shown a pragmatic framework for loading and querying semi-structured within a data vault on Snowflake. Load into data vault now or load it later, regardless it should live in the data vault somewhere where the source of the facts is accessible from a single data platform and relatable to the overall enterprise data model.
- Questioning the Lambda Architecture, bit.ly/3CtWZQo
- Streaming 101: The world beyond batch, bit.ly/2NkzTDa
- Microservice Principles: Smart Endpoints and Dumb Pipes, bit.ly/3ni24EN
- Originally posted here: https://www.snowflake.com/blog/handling-semi-structured-data/
The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.