Build a RAG-Based AI Application on SEC Filing Data Using Snowflake Cortex and Streamlit

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

5 min readJan 31, 2024

Snowflake Cortex makes building a Retrieval-Augmented Generation (RAG) application in Snowflake simple. In this article, we’ll explore how to use Snowflake for the main components of a RAG application: data chunking, vector embeddings, vector similarity, and using an LLM to respond to the user. Additionally, we’ll use Streamlit in Snowflake for a friendly user interface.

What we won’t be doing is setting up a additional infrastructure, a separate vector database, or any external API calls. Snowflake Cortex makes it easy to bring Gen-AI work to the data instead of moving your data outside of your Snowflake security and governance perimeter. All the code to run these examples is available on GitHub.

Dataset

We will be using the Cybersyn LLM Training Essentials dataset from the Snowflake Marketplace. This dataset includes text-based datasets, one of which is corporate SEC filings.

For the sake of simplicity, we will get the most recent 10-K data for each company in the dataset. We’ll create a table that includes the company name, Cybersyn’s document ID, some information about the company that we can use for filtering and exploration, and the full text of the 10-K filing.

create or replace table sec_reports_base as 
SELECT
    txt.sec_document_id,
    companies.cik,
    txt.variable_name as document_type,
    companies.company_name,
    companies.sic_code_category,
    companies.sic_code_description,
    companies.country,
    txt.period_end_date,
    txt.value
FROM 
    llm_training_essentials.cybersyn.sec_report_text_attributes AS txt
    JOIN llm_training_essentials.cybersyn.sec_cik_index AS companies ON (companies.cik = txt.cik)
WHERE 
    txt.period_end_date >= '2020-01-01'
    and document_type = '10-K Filing Text'
qualify row_number() over (partition by companies.cik order by period_end_date desc) = 1

Chunking the Data

Most of these 10-K filings are too large to feed into an LLM prompt, so our next step is to chunk the data into smaller pieces. For this, I used a Snowflake Table Function that Tom Christian outlined in his Medium Article called Build End-to-End RAG Applications with Snowflake Cortex. The code for this UDTF is available in the GitHub repository for this project.

For the sake of not taking the contents of chunk out of context, the UDTF is set up so that each chunk has about 20% overlap with the prior chunk in the sequence. I encourage you to try different chunk lengths, overlaps, and separators depending on your use case.

Vector Embeddings

Vector embeddings are a numerical representation of the semantic meaning of data. We can use the Cortex embed_text function to create embeddings for your content chunks and store it those embeddings alongside the rest of our data in a column with the VECTOR data type. We’ll also store some additional metadata about the document and the location of the chunk from the document.

create or replace table content_chunks_10k as 
select 
    sec_document_id,
    document_type,
    company_name,
    sic_code_category,
    sic_code_description,
    country,
    period_end_date,
    chunk as content_chunk,
    snowflake.cortex.embed_text('e5-base-v2', content_chunk) embedding,
    start_index,
    row_number() over (partition by sec_document_id order by sec_document_id, start_index) as document_index_rownum,
    row_number() over (order by sec_document_id, start_index) as rownum
from 
    sec_reports_base,
    table(chunk_text(value))
where 
    length(content_chunk) <= 1000;

Content Retrieval

Now that we have our data chunked and embedded, we can start asking questions of the data. To do this, we need to get the embedding for our question, then use a vector distance function to find the content chunks that are most semantically similar to our question. In this example, we’re using cosine distance, but Cortex also has functions for Euclidean distance and inner product.

set question = 'How many customers does the company have?';

-- use vector search to find relevant content chunks
select 
    content_chunk
from 
    content_chunks_10k 
where 
    company_name = 'SNOWFLAKE INC.'
order by 
    vector_cosine_distance(embedding, 
                           snowflake.cortex.embed_text('e5-base-v2', $question)) desc
limit 10;

Running this query gets us ten content chunks that are related to our question. Because we’ve sorted by the vector distance, the top should be the most relevant to our question and each subsequent result should be less relevant than the previous one.

This looks pretty good. But we can create a more user-friendly result by feeding these results into an LLM.

Complete Retrieval-Augmented Generation Workflow

Now that our data has embeddings and we’ve experimented with the Cortex vector distance functions, we can use the Cortex SUMMARIZE function to take our question and the most relevant content from our content chunks to provide a concise, user-friendly response. We can also bring the original chunk from the document to verify the context that informed the LLM’s response.

-- get content chunk by using vector distance function
with context as (
    select 
        content_chunk
    from 
        content_chunks_10k 
    where 
        company_name = 'SNOWFLAKE INC.'
    order by 
        vector_cosine_distance(embedding, 
                               snowflake.cortex.embed_text('e5-base-v2', 
                                                           $question)
                               ) desc
    limit 1
) -- provide content to cortex complete function
select 
    snowflake.cortex.complete('llama2-70b-chat',
                              'Use the provided context to answer the question. Be concise. ' ||
                              '###
                               CONTEXT: ' ||
                              context.content_chunk ||
                              '###
                              QUESTION: ' || $question ||
                              'ANSWER: ') as response,
    context.content_chunk
from 
    context;

Query Result

Streamlit Application

In order for an organization to fully benefit from Gen-AI, it needs to be available to everyone in the organization, not just a select few with AI expertise. Cortex functions allow anyone with basic SQL or Python skillsets to easily leverage Gen-AI within Snowflake.

Streamlit allows us to take that a step further and allow data scientists and analysts to allow anyone in your organization to leverage your unstructured data and advanced AI capabilities.

Now we have an application where a user can select a company and ask questions from the 10-K filings rather than skimming until they find the relevant information.

Extra Fun

By changing our prompt, we can have the LLM provide responses that are just as correct but perhaps a bit more fun. Here are a few examples.

Prompt: Use only the context provided to answer the question. Respond like Ted Lasso.

Prompt: Use only the context provided to answer the question. Respond with a limerick.