Build End-to-End RAG Applications with Snowflake Cortex

12 min readJan 4, 2024

Can you perform end-to-end RAG, solely within a single data platform? Not only is it possible, the Cortex platform makes the process remarkably straightforward.

Announced during Snowflake’s November Snowday, Cortex is a new, intelligent, fully managed service for empowering analytics and rapidly building GenerativeAI applications. To put that vision to this test, I’ve created this blog to demonstrate how features within Cortex enable the holy grail of simple Retrieval Augmented Generation, all within the comfort of a Snowflake Notebook. Below I’ll be covering some of the depths of this topic, if you’d rather skip straight to the raw code, you can find the notebook on GitHub here.

What is RAG – and why is it important?

Large Language Models [LLMs], as well as their power and capability, need no introducing. They are general purpose tools trained on an enormous corpus (eg: the internet) meaning that they can quickly adapt to an increasingly wide variety of tasks… but there’s a catch.

The fatal flaw for most models is that they are not — you pray, as you look over at your InfoSec team — trained on your own highly valuable data. The majority of the internet up until a point in time, yes. The knowledge that would allow you to build a custom, accurate, application built on your business data? (Not unless you happen to be the New York Times).

Without that knowledge, LLM are subject to fail. Or worse, prone to hallucinations — leading to incorrect answers and a failure of trust.

Retrieval Augmented Generation [RAG] gives you a framework to address that gap, and much more. Simply put, it’s a method of providing a model with relevant as well as up to date data from a repository, alongside the original question, without the need to tune or build, train, operationalise, and monitor a model of your own making. RAG, along with similar techniques, is a cost efficient means to increase the accuracy and trust of your GenerativeAI applications.

Traditional RAG spread across multiple systems (src: here)

But, how does it work? In a traditional sense, RAG requires multiple services running across a disparate architecture following this flow:

A process for ingesting your data (text, audio, PDF, etc).
That data then needs to be prepared, or chunked, into smaller contextually rich blocks.
Convert those chunks into embeddings and store that in a Vector database.
Initialising, hosting, and running a model — or using a model service such as GPT.
An application for taking the user’s prompt, searching the vector store, retrieving the most relevant chunk, and passing that alongside the prompt to the model.

These steps are typically isolated and rarely colocated. Forcing data to move, infrastructure to be managed, and the entire processes to be governed. You once again glance at your InfoSec colleagues in fear. They shake their head, disapprovingly.

But fear no longer dearest reader — the rest of this blog is dedicated to demonstrating how you can use a powerful framework like RAG, without leaving the Snowflake platform. All thanks to Cortex, Snowpark, and Notebooks.

Empower your Llama — Snowflake makes it simple.

Cortex is king

For the uninitiated, what gets me excited about Cortex is its ability to offer the power of leading open Large Language Models directly within your data platform, all in an incredibly easy to use manner. Broadening the reach of users that can directly take advantage of LLMs further than ever before.

As demonstrated in this announcement video, Cortex provides both general purpose functions — LLM use case specific functions each with a fine tuned model underneath — as well as vector functions for powerful search without the need to manage or host the model / infrastructure required to do so. All you need is to run the function most appropriate to your use case.

To show Cortex’s power in action, I’ll be using a Snowflake Notebook to demonstrate how Cortex makes RAG incredibly easy. From start to finish.

Start with your data, obviously

Your RAG-requiring data will be in many shapes and forms. For the walkthrough, you’ll be using PDF documents which could be made up SEC filings, car manufacturer guides, research papers on critical topics, or even podcast transcriptions. RAG is a flexible framework that you can use regardless of the subject matter. However, if you already have raw text, feel free to skip this step.

To extract the raw contents from the documents, you’ll be using Snowpark. Snowflake allows for storing unstructured content in a stage, providing URLs (in this case a temporary, or scoped URL) you can use to process documents at scale. Part 1 is straightforward enough, the function you’ll build will extract all the text found within each PDF document you feed it from a stage, using PyPDF, to return blocks of text.

#Create a Snowpark based function to extract text from PDFs
from PyPDF2 import PdfFileReader
from snowflake.snowpark.files import SnowflakeFile
from io import BytesIO


def readpdf(file_path):
    whole_text = ""
    with SnowflakeFile.open(file_path, 'rb') as file:
        f = BytesIO(file.readall())
        pdf_reader = PdfFileReader(f)
        whole_text = ""
        for page in pdf_reader.pages:
            whole_text += page.extract_text()
    return whole_text

As this is a function that you’re likely to reuse, registering it as a User Defined Function is the logical next step. Note the stage location, update that to a place you’d like this function securely stored.

#Register the UDF. 
session.udf.register(
    func = readpdf
  , return_type = StringType()
  , input_types = [StringType()]
  , is_permanent = True
  , name = 'SNOWPARK_PDF'
  , replace = True
  , packages=['snowflake-snowpark-python','pypdf2']
  , stage_location = 'LLM_DEMO.RAG.UDF'
)

In stunningly simple fashion, Snowflake Notebooks allow you to switch between Python and SQL seamlessly. For your next cell, you’ll take advantage of that to build your table full of rich extracted text. Notice how we can also pull important information (eg. file name) about the file’s metadata alongside our raw text.

CREATE OR REPLACE TABLE RAW_TEXT AS
SELECT
    relative_path
    , file_url
    , snowpark_pdf(build_scoped_file_url(@stage_name, relative_path)) as raw_text
from directory(@stage_name);

Time to take a chunk out of this text

Chunking, for the most part, is a fun name for splitting large bodies of text into smaller ones. There are many ways to chunk, and your approach will vary dependent on the text you’re using. No matter how you choose to chunk, it’s important that you both keep within the token limit the model enforces on you — but also ensuring that your text keeps whatever contextually important information is required in the same chunk.

For example, if your original text contained the sentence:

Drinking coffee 20 times a day has been linked to a vastly increased alertness in humans.

Only for the next line to be:

However, that same study also found those subjects reported a resting BPM of 175 as well as the inability to sleep, at all.

You’d want to colocate that information into the same chunk, not to distort the original meaning or sentiment.

It’s worth also considering that chunking not only allows you to stay within the token limit of models — but also leads to increased efficiency and reduced cost. Smaller tailored chunks will allow you to prompt models with only the specific text it needs to consider, rather than the document as a whole.

To chunk, you’ll be using LangChain (a popular LLM library) running and managed solely within Snowpark. The use of Langchain here neatly simplifies the step through the purpose built RecursiveCharacterTextSplitter.

#A class for chunking text and returning a table via UDTF
from snowflake.snowpark.types import StringType, StructField, StructType
from langchain.text_splitter import RecursiveCharacterTextSplitter

class text_chunker:

    def process(self,text):        
        text_raw=[]
        text_raw.append(text) 
        
        text_splitter = RecursiveCharacterTextSplitter(
            separators = ["\n"], # Define an appropriate separator. New line is good typically!
            chunk_size = 1000, #Adjust this as you see fit
            chunk_overlap  = 50, #This let's text have some form of overlap. Useful for keeping chunks contextual
            length_function = len,
            add_start_index = True #Optional but useful if you'd like to feed the chunk before/after
        )
    
        chunks = text_splitter.create_documents(text_raw)
        df = pd.DataFrame(chunks, columns=['chunks','meta'])
        
        yield from df.itertuples(index=False, name=None)

Successful chunking will require you to adjust the parameters above to match your own use case. Adjusting the separator and size will aid you greatly. In an effort to keep as much context per chunk possible, overlap will ensure each chunk includes a portion of the previous chunk within it too. As a starting point, you can always use relatively large chunks and reduce the scope as the structure becomes clear.

Once complete, you can register this class as a User Defined Table function [UDTF], as we’ll generate multiple results per row. Keeping the metadata log of your chunks is optional, but may prove to be useful further on.

#Register the UDTF - set the stage location

schema = StructType([
     StructField("chunk", StringType()),
    StructField("meta", StringType()),
 ])

session.udtf.register( 
    handler = text_chunker,
    output_schema= schema, 
    input_types = [StringType()] , 
    is_permanent = True , 
    name = 'CHUNK_TEXT' , 
    replace = True , 
    packages=['pandas','langchain'], stage_location = 'LLM_DEMO.PODCASTS.UDF'

Once more, you can switch to SQL to build your next table, full of beautifully chunked up… chunks. This example also includes the file name, allowing you to easily identify chunks per file, but also filtering within an application.

--Create the chunked version of the table
CREATE OR REPLACE TABLE CHUNK_TEXT AS
SELECT
        relative_path,
        func.*
    FROM raw_text AS raw,
         TABLE(chunk_text(raw_text)) as func;

Embed your data — with Cortex

Thinking back to the diagram at the start, at this point in your RAG journey you’d normally be scrambling around to somehow convert your newly fashioned chunks into embeddings. Simultaneously firing up a new vector database to store the resultant series of meaningful numbers. Cortex drastically changes this pattern in a dopamine releasing, easy to use way.

The hero here comes in the form of ready-built function: embed_text(). All embed_text requires is your text to be converted, as well as your choice of model (currently e5-base-v2 is supported)… and that’s it. You’re done, seriously. Let’s look at the SQL you’d need to perform this historically arduous task:

--Convert your chunks to embeddings
CREATE OR REPLACE TABLE VECTOR_STORE AS
SELECT
RELATIVE_PATH as EPISODE_NAME,
CHUNK AS CHUNK,
snowflake.cortex.embed_text('e5-base-v2', chunk) as chunk_embedding
FROM CHUNK_TEXT;

One single function, the result? Embeddings stored natively inside a Snowflake table.

Am I lucid dreaming — or is it really this easy?

Search for what matters most

Embeddings are one piece of the RAG puzzle — what enables the framework to return meaningful results is being able to search across that sea of numbers for the closest matching result.

Cortex provides a set of highly useful search based functions for this specific task. You’ll be using one of these, VECTOR_L2_DISTANCE, which identifies the most similar embedding to another given embedding through the Euclidean distance (the square root of the sum of the squared vector values, maths ❤). Given RAG as the purpose, this is the tool you need to identify when a user asks a question such as “What makes time perceived to be slower?” you’re able to find the closest matching chunk of text to that particular ask.

Taking a look at the example SQL below — the first step is ensuring that the user’s question (or prompt) is also converted into an embedding before you attempt to search. Vector distance requires multiple vectors rather than the raw text. Once converted, your table of chunks and embeddings can be sorted by the closest matching result based on the distance, returning the raw text required.

SELECT EPISODE_NAME, CHUNK from LLM_DEMO.RAG.VECTOR_STORE 
            ORDER BY VECTOR_L2_DISTANCE(
            snowflake.ml.embed_text('e5-base-v2', 
            'What makes time perceived to be slower?'
            ), CHUNK_EMBEDDING
            ) limit 1
        ;

Asking that question of raw podcast transcripts returns a result from an episode on “Time Perception & Entrainment..” that appears to cover off the primary drivers of perception. Fantastic!

Given your enthusiasm for this topic, time must be flying in the background.

Yet another note on chunking

In the above example, you’re surfacing the most relevant chunk and that alone. However, it’s important to note that there are a few approaches to chunk retrieval that you may wish to consider. For example:

Providing the top k chunks. Rather than returning a singular chunk, you could opt to provide a handful of chunks and have the LLM determine which piece of information is most suitable given the question. This may be useful where the chunk size is smaller, the question could be vague, or the answer may feature multiple times across a given document. Retrieving and summarising several chunks could glean more information. A word of caution here is that providing more chunks would increase the input token cost of interacting with a model.
The closest matching chunk + adjacent chunks. In the initial chunking section, we covered how overlap help to reduce potential context loss across text. Another strategy you could consider is providing the top chunk as well as each adjacent chunk based on the index starting point. Your chunks in this instance could be smaller would also reduce the need for storing overlap, potentially leaving cleaner text segments.
Your method of search. In the example, you’re using the Euclidean distance — but may find more accurate returns using the cosine similarity or inner product (for example). Snowflake Cortex allows you to pick between each method but altering the function appropriately. More on this to follow.

Thankfully, given the simplicity of the vector functions, testing each approach is remarkably quick.

RAG? “Completed it mate”.

So you have your valuable text, chunked into the perfect segments, converted into embeddings and fully searchable — so close to completion, you can almost taste it. Now just find a model, host that model as a service, run the infrastructure to manage it… or, use our new friend Cortex and skip that step entirely.

The final function is the fully flexible Cortex.Complete(). What complete allows you to do is select your model of choice (currently Meta’s Llama2) and provide it any given prompt, leaving the difficult model management process to Snowflake. All you have to do is call the model as and when you need it, based on the size and model type you need.

When it comes to RAG, you’re going to combine your user’s question as well as the top chunk within your model prompt. For example:

--Pass the chunk we need along with the prompt to get a better structured answer from the LLM
SELECT snowflake.cortex.complete(
    'llama2-7b-chat', 
    CONCAT( 
        'Answer the question based on the context. Be concise.','Context: ',
        (
            SELECT chunk FROM LLM_DEMO.RAG.VECTOR_STORE 
            ORDER BY vector_l2_distance(
            snowflake.cortex.embed_text('e5-base-v2', 
            'How should I optimise my caffeine intake?'
            ), chunk_embedding
            ) LIMIT 1
        ),
        'Question: ', 
        'How should I optimise my caffeine intake?',
        'Answer: '
    )
) as response;

Using the same strategy as we did before, asking a different question — you’ll see how the chunk will form the “context” section of our prompt, effectively giving the LLM the answer alongside the question. The LLM refines that result in an elegant fashion, even using a smaller model (llama2–7b-chat here).

Don’t tell me when to drink my coffee Llama!

However, if you’ve already put in the effort to fine tune a model of your own that you’d like to run within the Snowflake platform itself, take a look at Snowpark Container Services (now in Public Preview!).

Streamlit… in a Notebook… in Snowflake

Effectively complete, you let out a scream of joy — even your InfoSec team manage a collective thumbs up, maybe even a tear. But wait! There’s one final bonus portion. Snowflake Notebooks allow you to go the one step further - executing Streamlit code directly in the Notebook itself. App-ception! 🤯

Think of this as a means to testing that your framework is running as expected, that your chunks are optimised and and search reflects the information you need correctly, before porting that code over to a fully-fledged Streamlit-in-Snowflake app.

To bring this to life, in a cell in your Snowflake notebook, you’ll want to enter the following code. Allowing you to parameterise the prompt as well as the model you’d like to use.

import streamlit as st # Import python packages
from snowflake.snowpark.context import get_active_session
session = get_active_session() # Get the current credentials

st.title("Ask Your Data Anything :snowflake:")
st.write("""Built using end-to-end RAG in Snowflake with Cortex functions.""")

model = st.selectbox('Select your model:',('llama2-70b-chat','llama2-13b-chat','llama2-7b-chat'))

prompt = st.text_input("Enter prompt", placeholder="What makes time perceived to be slower?", label_visibility="collapsed")

quest_q = f'''
select snowflake.ml.complete(
    '{model}', 
    concat( 
        'Answer the question based on the context. Be concise.','Context: ',
        (
            select chunk from LLM_DEMO.RAG.VECTOR_STORE 
            order by vector_l2_distance(
            snowflake.ml.embed_text('e5-base-v2', 
            '{prompt}'
            ), chunk_embedding
            ) limit 1
        ),
        'Question: ', 
        '{prompt}',
        'Answer: '
    )
) as response;
'''

if prompt:
    df_query = session.sql(quest_q).to_pandas()
    st.write(df_query['RESPONSE'][0])

Insight in an app in a notebook in Snowflake

The final result — an end-to-end RAG process entirely contained within a single notebook, all running on Snowflake.

Snowflake Notebooks have the added advantage of being pretty.

Finishing Thoughts

Cortex paves the way towards making RAG a simple enjoyable process. Given its popularity as a framework, I’m excited to see these features within the Snowflake platform in the near future — as well as the customised, tailored, LLM-powered applications you’ll build as a result.

If you’d like access to the raw code, without making you scroll to the top, once again you can find the notebook here. And as ever, this excitement, appalling sense of humour, British spelling, and opinion are all my own and not that of my employer.

A special thanks to Dan Hunt, Ripu Jain, Jessie Felix, Venks Mantha, and Doneyli De Jesus for reviewing.