Data Vault in Snowflake: Amp it up on Data Vault knowledge

Patrick Cuba
The Modern Scientist
9 min readSep 10, 2024

According to Gartner’s hype-cycle for AI 2024, Generative AI (GenAI) sits on the “Peak of inflated expectations”, which Gartner describes as “an innovation that becomes the subject of over-enthusiasm and unrealistic projections.”

ChatGPT (Generative Pre-trained Transformer) — the chatbot and virtual assistant developed by OpenAI has recently caught the public’s attention, the business value is the corporate application of Large Language Models (LLMs) and the perceived return on investment. LLMs are expensive to train and require GPUs running 24/7 for weeks at a time to train on a large corpus of data. GPUs deployed at a Cloud Service Provider’s (CSP) are power hungry and must support the similar redundancy of other Cloud infrastructure which means that adequate GPUs should be available within a region or you need to build in best practices to ensure you can continue training the model should such a fault occur.

LLMs once trained respond to prompts using the context it is trained on and thus its value declines the older the LLM gets. To keep the LLM up-to-date and usable for your business context, your enterprise will either need to:

  • Fine-tune the LLM with more recent data
  • Augment the LLM with chunks of data to match to using a Retrieval Augmented Generation (RAG) framework
  • Engineer prompts so that the LLM responds with meaningful context, in turn the LLM caches what it learns from your prompts.

In this article we will discuss the Snowflake architecture components you can use today to deploy a naïve RAG architecture to learn and answer prompts based on “the Data Vault Guru”, yes my book.

Episode 23: Amping up on Data Vault knowledge

“Data Vault is a System of Business Intelligence containing the necessary components needed to accomplish enterprise vision in Data Warehousing and Information Delivery.” — Dan Linstedt, Data Vault inventor

There’s a lot to impact in the above quote; in my interpretation these are the goals of a data vault:

  • Business Architecture — a Data Vault meets enterprise vision by mapping to what Business Architecture defines as Business Objects; Information Mapping (Business Glossary) realises a Business Object as data (Corporate Memory) with the purpose to establish the Single version of the Truth (SVOT)
  • Passive integration — a Data Vault is a top-down model, however applications built or bought to run a business are merely automation engines for business processes and rules, the same business entities exist in the source application with a unique business key. Data vault passively integrates business entities by business key into hub tables, the gaps in the source (business processes as the business sees it versus how the vendor provides it) are typically solved at source, in pre-staging or in business vault (in that order, shift left).
  • Data mapping — source data models ‘map’ to hubs, links and satellites and conform the enterprise software landscape data into repeatable three table structures to record the corporate memory.
  • Equi-join — the data vault is simple; you should endeavour to ensure consuming data from a data vault can be done using EQUI-JOINs and utilise the OLAP capabilities of the platform (Snowflake).
  • Repeatable patterns — with a limited set of tables comes a limited set of parameterised templates to load, test and query a data vault loaded in parallel.

Bill Inmon (father of the Data Warehouse) updated his definition of what a Data Warehouse is when he was sold on the concept of the Data Vault, “A data warehouse is a subject-oriented, integrated (by business key), time-variant and non-volatile collection of data in support of management’s decision-making process, and/or in support of auditability as a system-of-record.’

You can purchase the book and spend the time to understand the data vault concepts above, but to summarise the content efficiently and retrieve the answers you need we could also vectorize it. Let’s describe the components in Snowflake we will use to read my book!

  • Vector Data Type — a new data type available in Snowflake FDN tables to store embeddings.
  • Snowflake Notebooks — Snowflake’s implementation of the popular notebooks interface for interactive and context rich programming environments built into Snowsight.
  • Streamlit in Snowflake (SiS) — built-into Snowsight too and retrievable via the Snowflake notebooks is the component-rich interface to rapidly build interactive apps.
  • Snowflake Cortex built in LLM functions — instant access to LLM models trained by Mistral, Reka, Meta, Google and Snowflake’s Arctic LLM model.
  • Python Function — uses Snowpark python libraries within your user defined table function for Python dataframes.
  • Directory tables — stores unstructured file metadata for data you load into a Snowflake internal or external stage.
  • Streams — an offset deployed on a Snowflake accessible object to track what of the object’s content was consumed.

The proposed architecture…

Naive RAG, for other RAG types click here.

Let’s put the content to good use and build our application to read my book!

Step-by-Step: New Context

All code is available in the reference section of this article.

1. Create Stage with a Directory table

Snowflake stages (internal or external) are simply blob storage areas for structured (CSV), semi-structured (JSON) or unstructured (PDFs, images) data. Creating a directory ‘table’ based on a stage registers the relative_path and file_url of unstructured files in the stage area. Optionally you could define a stream on the directory table making the upload code repeatable to detect and load new PDFs as they are loaded to the Snowflake stage.

2. Define PDF Chunker

The Data Vault Guru book is over 650 pages long with nearly 170,000 words; we define PDF chunk size and overlap to adjust for LLM response precision and relevance. Small chunks improve precision but may require more indexing and thus increase computational overhead. Small chunks lack relevant context to queries and this can be mitigated by adjusting context overlap.

  • Smaller chunks with higher overlap are better for answering specific questions.
  • Larger chunks (200–300 words) are better for document retrieval

We will use the LangChain library which is available in Snowpark.

3. Define the Vector store

LLMs do not understand free text, instead the process to understand text is represented by embedding tokens (words, parts of words or paragraphs of words) into a multi-dimensional array called a vector. Snowflake supports storing vectors in a vector data type. Embeddings are vectors that the LLM model uses to ‘understand’ and process the text. At the time of writing Snowflake supports two embedding functions with three specific embedding models,

4. Upload and Parse the PDF into the Vector Store as chunks

This step now populates our vector store by combining the above components into a neat Snowflake table with a vector data type. PDF chunks are split using the PDF Chunker table function in combination with the build_scoped_file_url function to retrieve unstructured content.

You will now see the chunked plain text adjacent to its tokenised representation.

Chunking and Embedding side-by-side

Step-by-Step: Streamlit and Notebooks

1. Setup Streamlit in Snowflake

Snowflake includes out of the box LLM models specific to Snowflake region availability. Within this Streamlit app we will include a prompt template with specific instructions to any LLM model you select to generate a response.

The template instructs the LLM to answer your query in a specific way no matter what prompt you parse to the model (you can include additional instructions or change the template). To augment relevant chunks retrieved from the vector store we just loaded; we embed the prompt with the same embedding function we used earlier and run a comparison between the embedded prompt and the embedded chunks using Snowflake’s Cortex Vector Cosine Similarity function. The relevant chunks for the incoming prompt are included with Snowflake’s Cortex Complete function to generate a response. You will also select an LLM model supported in your Snowflake account’s region.

2. Integrate to Snowflake Notebooks

Snowflake Notebooks is a great way to provide a context rich programmer experience for developers and data scientists. You can include markdown, documentation and multi-language support within a single interface. Additionally, Snowflake Notebooks support Git integration and you can even run your Snowflake notebooks on a schedule.

Integrated Streamlit into a Snowflake Notebook

Let’s Prompt!

Now that we are all set up, let’s get to work!

Just some benefits of a Data Vault, including hallucinations!

With an updated chatbot we’re able to retrieve the content we need quickly and efficiently, but of course beware that an LLM is always vulnerable to hallucinations, check the output as it gets returned from the LLM model.

Other ways to leverage LLMs with Snowflake

Although we have built a simple naïve RAG architecture using completely native features within Snowflake; don’t think that you are limited to what Snowflake supports within its security perimeter to build your LLM-based solutions. Yes, Snowflake has the features that you can use to leverage and to connect to other LLM providers securely.

Integrate with OpenAI

Reaching out to OpenAI’s APIs (or another LLM provider) you can leverage their capabilities for LLMs, these are the features you will need to get familiar with:

  • Secrets — secure authentication to APIs and OAuth, including how to use it.
  • Network Rules — to control allowed IP addresses for ingress and egress rules
  • External Access Integration — creating secure access to specific network locations external to Snowflake
  • API Integration — alternative to external access, this integration restricts API access to an API gateway
  • As an External Function — calling code resident on the Cloud Service Provider, like AWS Lambda.

For an example of this extension see the references section below.

Other Snowflake Data Science features

Snowflake’s LLM and ML features under the Cortex banner are flourishing from offerings as a service to the functions you can leverage within your pipelines, at the time of writing ML functions are:

  • Forecast for time-series data with a minimum of 12 data points
  • Anomaly detection of outliers within a time-series
  • Contribution explorer for streamlining and improving the process of root cause analysis and changes in observed metrics.
  • Classification (not the same as auto-classification) is used to sort data into different classes using patterns detected in training data.

Over and above the LLM functions we have already mentioned, Snowflake also features:

  • Summarize a given text
  • Sentiment scoring an input text
  • Extract_Answer from a given question
  • Translate a given text to another supported text
  • Fine-tuning to customize LLMs for a specific task
  • Cortex search as a service to search using LLMs over your data
  • Snowflake Co-Pilot (text to SQL) — a Snowflake product that assists Snowflake users with other features and programming.
  • Snowflake Cortex Analyst (text to answer for business users) — a Snowflake product to reliably answer business questions based on structured data stored in Snowflake, using YAML.
  • Document AI using a proprietary Arctic-TILT (Text Image Layout Transformer) LLM model to consistently extract data from documents.
DIKW Pyramid

Conclusions — Humans in the Loop

Setting up an LLM in Snowflake is easy peasy, even I can do it! LLMs continue their surge in demand, and they will come with the same caveats as all other current artificial intelligence and machine learning data products, they need humans in the loop. The same needs your other enterprise data products have in data quality, observability, modelling and governance (while protecting sensitive content) is needed for your quality Generative AI initiative. LLMs are a great tool to free up yours and your employee’s time spent in time consuming tasks like running brute-force searches online, however most outputs should be checked for hallucinations by including citations. They can also be used in customer-facing application but again should constantly measured for hallucinations.

Are LLM’s ready for high-stakes AI applications yet? I’d say we are not there yet!

Here’s an interesting example when asking ChatGPT this simple prompt.

That’s not correct!

Until next time!

References

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

--

--

The Modern Scientist
The Modern Scientist

Published in The Modern Scientist

The Modern Scientist aspires to connect builders & the curious to forward-thinking ideas. Either you are novice or expert, TMS will share contents that fulfils your ambition and interest. Write with us: shorturl.at/hjO39

Patrick Cuba
Patrick Cuba

Written by Patrick Cuba

A Data Vault 2.0 Expert, Snowflake Solution Architect

No responses yet