Data Vault on Snowflake: …and the other 80% of the world’s data
Shockingly most of the world’s data is not in a structured or semi-structured form; according to Gartner, 80% of the world’s data is unstructured. But what qualifies as unstructured data? Is it valuable to an enterprise? How do we automate the necessary tasks when we’re working with data that has no structure?
You could argue that all data has some structure so let’s properly qualify into what category 80% of the world’s data falls into if we first define structured and semi-structured data:
- Structured data is what follows a normal form of records and columns; every record will have the same number of columns, and the data collected into this tabular structure is tracking instances of related concepts as entities and entity details. Each column will contain a data typed value either as an identifier identifying an entity or describing the entity (or describing the relationship between entities). Work must be done to ensure that the right data is recorded in the right target tabular column, and this must be done upfront and if a new columnar value is needed to be recorded about that entity in the same table then the data team must decide where those new values should be recorded. If this is a table tracking history, what do we do for the records that did not have a value for this column historically? Do we (or can we) backfill that data across time?
- Semi-structured data solves this schema evolution problem by tracking entity and descriptive content as key-value pairs. Instead of requiring defining the columnar structure and data typing upfront, data typing is instead defined when the data is read (aka schema-on-read). If new attributes are needed you simply add new key-value pairs and you do not need to think about how that could impact historical records about the same entity, however the same key should always contain the same type of value. Semi-structured data types also support advanced data types like objects, structs, maps and arrays — content that cannot be easily maintained in structured (primitive) data columns, something Snowflake pioneered.
- Both structured and semi-structured data contain constraints on how each column or key must be used respectively; unstructured data has no constraint except that the recorded format should represent a similar subject or concept of interest. With no structure to speak of, to derive value from unstructured data you need to apply structure to the streams of unstructured content. That structure is a constraint, that structure is an intentional bias; and to process that data at scale both must be automated in a repeatable pattern. Because of the unstructured nature of this data, it can be interpreted in different ways. The same unstructured data file can be used to accommodate multiple constraints and biases (bias-on-read), as long as neither constraints nor biases are directly tied to the data. What qualifies as unstructured data is audio, emails, social media posts, images, video files, PDFs — the interpretation depends on desired outcomes such as sentiment analysis, named entity recognition, text summarization, translation, text classification, semantic search, anomaly detection to name a few.
To operationalise structured and semi-structured data processing we use data integration tools to extract, transform and load that data into a data warehouse; or subscribe to a topic and load data into a data lake and transform on read. To operationalise unstructured data processing, we need artificial intelligence (AI). Gartner has plotted this relationship between data forms and its typical uses into a quadrant charting the relationship between knowns and unknowns.
- A data warehouse is where you will find structured data
- A data lake is where you will find semi-structured and unstructured data
- The combination is where you will find a data lakehouse.
- The rest is left up to data science.
Fractional distillation is the process by which oil refineries separate crude oil into different, useful hydrocarbon products based on their relative molecular weights in a distillation tower. The same metaphor is applicable to the distillation of unstructured data;
- Known Knowns (KK): These are the facts we know we need from the data.
- Unknown Knowns (UK): Given the facts presented what predictions that could be made.
- Known Unknowns (KU): Where we know there are probable facts we do not know of yet.
- Unknown Unknowns (UU): Neither fact nor insights are yet known.
Just like today’s oil drilling technology is only able to retrieve around 30–50% of the oil in an existing reserve; processing the same unstructured data using different modern techniques derives more analytical value from that same data (with crude oil different products are derived by applying different temperatures). Sure, we could say we do that already when processing structured and semi-structured data from a data lake and into information marts, but these are already defined in the form of data typed columns and key-pair values respectively. Unstructured data has nothing defined and although there is an original intent with processing that data, like a cold case reopened, unstructured data can be processed under a new lens when new information is sought.
This article will discuss the necessary techniques you could inject into your data pipelines and what architectural considerations are necessary for processing unstructured data in Snowflake.
Episode 25: Processing Unstructured data into Data Vault
Data vault is a discipline and methodology to store and process structured and semi-structured data. Even in the context of this article this will not change, what differs in this context is the data source and what we do to process unstructured data into a data vault. Yes a data vault is your auditable data source and used as a base for information delivery; to do that repeatedly you need repeatable data patterns. Therefore, unstructured data must be processed before a data vault in the pre-staging layer ~ i.e. what techniques we use to derive value from the unstructured content. This might be simpler than you imagine, because all businesses need to know three things about their data:
- What are the business objects and how do we identify them? Stored as hub tables containing the unique list of business keys. You extend these with your business ontology describing each entity type.
- What other business entities are they related to? Stored as link tables containing the unique list of relationships, interactions, events and transactions between business entities.
- What is the current and historical state of those entities and relationships? Stored as satellite tables tracking true changed state of business entities and relationships.
With these three requirements for enterprise data, we know what we want from unstructured data.
- A unique identifier for the business objects or objects we care about
- What other business objects are these business objects interacting with? This could be the instance of the unstructured file itself may be the result of an event or recording a transaction. These things require more than one business object we care about.
- The descriptive content we can gather of those business objects or descriptive content about the relationship itself and not anything else.
Now we know what we want, how do we go about to get it?
Scraping the barrel
Not all data is of equal value, but those that are, are categorised and modelled into the appropriate table structures describing business objects and their interactions. To achieve the same outcome based on unstructured data we must use reliable technology to scrape what we want from documents in a repeatable template, we will use Snowflake’s Document AI. Document AI is an industry pattern that uses Natural Language Processing (NLP) to process documents (unstructured data) reliably and turn those outcomes into structured table columns. Snowflake’s Document AI uses Snowflake’s own proprietary model called Arctic TILT, an industry leading TILT model for document processing using NLP.
To use the tool successfully, we must use the same machine learning pipeline we have discussed before:
- Train the model on a set of sample documents, Snowflake recommends using fifty or more sample documents, verify the outcomes and correct false positives where needed. Here you will define values (as the Document AI interface describes it) that will be turned into structured columns.
- Deploy the trained model to execute its inferencing on the rest of your document pile.
- Monitor the outcomes to track for model drift.
Because we are using AI to process unstructured data, the framework we will discuss relies on having existing business objects and their respective business keys already loaded to the data vault. In other words, should this pipeline identify business objects that are not previously identified, these should be diverted to an exception table rather than being loaded directly into the auditable data vault. The exception table must then be analysed by subject matter experts to identify:
- If the business object was correctly identified (true positive) the records will be allowed to progress further into your data vault.
- If the business object was incorrectly identified (false positive) then we need to further train the model. AI training is a continuous exercise, necessary when model drift is detected.
Yes, although AI is used with a measure of reliance, we do recognise it can make mistakes and therefore avoid the need to recover from corrupting your data vault by only processing data on business objects we know about.
Refining Pipeline
The pipeline for processing unstructured data in Snowflake follows a familiar route,
- Create a named internal or external Snowflake stage to store documents, we define a directory table upon stage creation. Files loaded into this stage are recorded in the directory table.
- Train a DocumentAI model based on training data, give it a name.
- Create a transient staging table to store prediction outputs, the outcome will include the identified DocumentAI values recorded in a JSON column and file details recorded in other arbitrary columns, such as the file name.
- Create a stream object on the directory table to detect when new documents are loaded to a stage.
- Create a Snowflake task to process new files as they arrive based on the stream, the task will execute the PREDICT function to return DocumentAI values as a JSON payload and use pre-signed URLs to access the files for prediction.
- Create a view over the staging table with the standard data vault metadata columns
a. Ensure the filename is defined as the data vault record source column (DV_RECSOURCE).
b. Ensure that the file timestamp is recorded as the data vault applied date timestamp (DV_APPIEDDATE).
c. Load date is recorded as the current_timestamp() (as always).
7. Create a stream object on the view for each target data vault object you will be loading.
a. For each hub table (min 1x stream)
b. If more than one stream is required for the business objects extracted from a document, this implies the need to record a relationship and therefore a link table. Some documents might require the application of Zero-Keys. (min 0x stream)
c. Descriptive details (non-business keys) are loaded to satellite tables, if personally identifiable information (PII) is recorded these must be loaded to their own PII satellite table. (min 0x stream) The record source will serve as a reference to where PII data is recorded as a document and should the need to de-identify a customer be required then you will know which document contains that information.
Although we have processed what we need from the set of documents now, in fact the same set of documents might be used for extracting other business objects, relationships and descriptive attributes. Yes, that’s the difference with unstructured data, we apply the bias-on-read although documents are created with bias in mind.
Does that cover 80% of the world’s data?
Documents still present some structure although we can classify them as unstructured, the structure is the bias (the intent of the document itself). For multi-modal analysis the intent of extracting data is the same,
- who or what is being identified;
- what business process, relationship, event or transaction is taking place between business objects, and;
- descriptive state information about the above
Today we showed how we can process documents reliably and repeatedly using Snowflake; tomorrow we will look at video, audio and images, the other modes of unstructured data with an enormous variance in bias! You should factor in data and model governance to ensure that the wrong bias is not extracted from unstructured data. A way to do that is to ensure that your data warehouse is content-rich (from other data sources) around a business key and therefore ensures that the facts derived from unstructured data fits the narrative already present in your data vault.
We will leave it at there, until next time!
References
- Querying Metadata for Staged Files — https://docs.snowflake.com/en/user-guide/querying-metadata
The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.