Building a RAG based Blog AI assistant using Streamlit, OpenAI and LlamaIndex

Retrieval Augmented Generation(RAG) is the most popular approach to building LLM applications today.

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

5 min readNov 6, 2023

Reference: https://www.maartengrootendorst.com/blog/improving-llms/

It is robust than a simple prompt engineering approach, but cost effective compared to a complex, uncertain fine-tuning process. In this article, we will explore how to build a Blog AI assistant that will answer questions based on the blogs it is trained on. Refer to the quickstart for step-by-step instructions on how to build the LLM assistant.

Build a Retrieval Augmented Generation(RAG) based LLM assistant using Streamlit, OpenAI and…

This quickstart will cover the basics of Retrieval Augmented Generation (RAG) and how to build an LLM assistant using…

quickstarts.snowflake.com

If you are a beginner to LLMs and building applications with LLMs, checkout the first blog in the series. We explored how to build an LLM application in Snowflake using Prompt engineering and how to evaluate the LLM app using an interactive Streamlit web app.

Key Features & Technology

Before building the LLM assistant, let us understand the key terminologies and what they mean.

What is a large language model (LLM)?

A large language model, or LLM, is a deep learning algorithm that can recognize, summarize, translate, predict and generate text and other content based on knowledge gained from massive datasets. Some examples of popular LLMs are GPT-4, GPT-3, BERT, LLaMA, and LaMDA.

What is OpenAI?

OpenAI is the AI research and deployment company behind ChatGPT, GPT-4 (and its predecessors), DALL-E, and other notable offerings. Learn more about OpenAI. We use OpenAI in this guide, but you are welcome to use the large language model of your choice in its place.

What is Retrieval Augmented Generation(RAG)?

Retrieval Augmentation Generation (RAG) is an architecture that augments the capabilities of a Large Language Model (LLM) like GPT-4 by adding an information retrieval system that provides the models with relevant contextual data. Through this information retrieval system, we could provide the LLM with additional information around specific industries or a company’s proprietary data and so on.

What is LlamaIndex?

Applications built on top of LLMs often require augmenting these models with private or domain-specific data. LlamaIndex (formerly GPT Index) is a data framework for LLM applications to ingest, structure, and access private or domain-specific data.

What is Streamlit?

Streamlit enables data scientists and Python developers to combine Streamlit’s component-rich, open-source Python library with the scale, performance, and security of the Snowflake platform. Learn more about Streamlit.

RAG architecture

The approach has three main steps.

Choose a foundation model of your choice to generate text. However, if I were to question the foundation model about the specifics of Snowpark and other features that were released recently, GPT-4 may not be able to answer.

Reference: https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models-customize-rag.html

Augment the input prompt (i.e., your question) with relevant documents. If we provide the model with Snowpark documentation or quickstart, it will be capable of answering questions. However, the context length of these models are small. GPT-4 has a context length of 4000 tokens only. 4000 tokens is about 500 words, which is roughly 3–4 paragraphs. But Snowpark documentation is more than 4 paragraphs. What could be done?
We take the Snowflake documentation and chunk it with ~500 words per chunk. We then convert each of these chunks into vector embeddings, store them in a vector store, and build an index for easy retrieval.
Query the foundation model for answers. During the inference phase, the input prompt is converted into a vector embedding, the vector store is searched to find the text chunk that has higher similarity to the input prompt and is returned to the foundation model.
The model then uses the chunk of document that is relevant to the query to answer the query.

Challenges in this approach

How can you split the document into meaningful chunks so the context is not lost?
What are the different indexes you can build?
How can you decide on the type of index to build for faster retrieval?

Here is where LlamaIndex comes in. It abstracts away the complexity in smart chucking and indexing of the document. All you need to do is to select which type of index you need based on your use case, and let LlamaIndex do the work.

In this example, we use the TreeIndex for document retrieval. The TreeIndex builds a hierarchical tree from a set of nodes which become leaf nodes in the tree.

During the inference time, it queries the index by traversing from root nodes down to leaf nodes. Once the leaf node/nodes with relevant keywords as the user prompt is returned, a response is returned by the index. This response is then augmented with a user prompt to chat with the model.

Build an LLM assistant

First step is to clone the git repository in your local by running the command:

git clone https://github.com/Snowflake-Labs/sfguide-blog-ai-assistant.git

Next, install the dependencies by running the following command:

cd sfguide-blog-ai-assistant && pip install -r requirements.txt

Next, customize the Blog AI assistant to answer questions about a blog or blogs of your choice by updating the PAGES variable in the `data_pipeline.py` file.
Run the `python data_pipeline.py` file to download the blogs as markdown files in the local directory.
Run `python build_index.py` to chunk the blogs into several chunks that can be augmented with the input prompt.
Run `streamlit run streamlit_app.py` to run the UI for the LLM assistant.

Note

In this example, we do not use a vector database to store the embeddings for each document chunk. However, you can play around with different vector stores to achieve that. In a future blog, we will explore how to work with a vector database to build a RAG based LLM chat app.

References:

Step by step quickstart on how to build a RAG based Blog AI assistant
Source code repo: https://github.com/Snowflake-Labs/sfguide-blog-ai-assistant

Conclusion & Resources

If you are looking to build more LLM Apps using Snowflake and Streamlit, check out these quickstarts:

Thanks for Reading!

If you like my work and want to support me…

The BEST way to support me is by following me on Medium.
For data engineering best practices, and Python tips for beginners, follow me on LinkedIn.
Feel free to give claps so I know how helpful this post was for you.