Why your next AI product needs RAG implemented in it
Showcasing Retrieval Augmented Generation (RAG) for chatbots and a step-by-step tutorial on how to build one for yourself or others
Chatbots, such as the very popular ChatGPT, use large language models (LLMs) like GPT to generate responses. Hence, they can easily answer our burning questions using the data they have been trained on. To me, it feels like they are more like the digital encyclopedias, pulling informations from a vast knowledge they have soaked up.
In fact, they are pretty handy in helping with everything — from generating food recipes, planning trips, to untangling tricky math problems.
If you are a developer, already building such chatbots
Leveraging LLMs aren’t that difficult as well. Please refer to my earlier blog posts below for tutorials and app demos…
Let’s recap an over-simplified workflow for building simple chatbots
- The end-user sends their queries (i.e. prompts)
- The query is passed and processed by the LLM (i.e. the pre-trained knowledge base), under the hood both the end-user’s prompt and system prompt get embedded.
- Finally, a well crafted response is generated and returned
Regular use of ChatGPT has highlighted the importance of crafting our prompts. A well-framed prompt yields a more accurate response from the LLM. In a previous post (linked below), I have demonstrated how we can even visualise data and create plots just by asking the right prompts.
Is Prompting is the next programming?
There’s an elephant in the room we can’t ignore any longer!
These simple chatbots have two major drawbacks :
- Limited to trained information: Bots like ChatGPT are not trained with new information after a certain point. So, they might not be aware of the latest happenings in the world.
- Makes things up or hallucinates: Sometimes, if unsure, LLMs can say things that sound true but aren’t accurate. They might fill in gaps with irrelevant responses. And often they tend to hallucinate.
To overcome such limitations and drawbacks, Retrieval Augmented Generation (RAG) enhanced chatbots becomes super powerful!
Below is an insightful article by MetaAI researchers who first showcased these perspectives.
I think it is worth sharing the main concept of RAG from the article’s abstract to provide a clearer and wider understanding:
Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures.
Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems.
Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks.
We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric memory for language generation.
This article motivated me to build and compare basic chatbots with those enhanced by RAG…
In a RAG-based chatbot implementation, our workflow requires few additional steps:
- User Interaction / Prompt usage: Like a simple chatbot, in our RAG based chatbot, the end-user needs to submit a query.
- Orchestration of Prompt / Prompt template: Implementation of a related conversation history or adding more context (this step also comes later while augmenting the prompt with context).
- Retrieval / Pulling data from an external knowledge base: Before sending the prompt to the LLM, the system consults with retrieval tools. These tools often include knowledge bases and APIs. For instance, Wikipedia or vector Datbases like Pinecone or Weaviate. The retrievers aim to obtain context from the knowledge base.
- LLM Processing: Having the added context via the retrieval tools, the prompt is now aided with added context. And finally this prompt (user prompt + System Prompt + Context) is sent to the LLM.
- Response Generation: The LLM, now informed with a better and informative prompt, crafts a relevant and informed response.
Thus, this approach using RAG enables LLMs to deliver precise and current information, even if their base training data does not change.
🤖 Build a chatbot with RAG pipeline
Now that we have an overall idea on the key aspects of RAG based chatbots — let’s try to build and deploy one! We will be using LangChain and Databutton for building this chatbot.
A big shoutout to open-source platforms like LangChain and LlamaIndex — they have immensely simplified the orchestration layer and the integration with LLMs through the suite of tools that they offer.
🧠 Brain of our app — External Knowledge bases
Since our external data sources are PDF files from end-users, let’s start by writing few functions to ingest that data.
# Importing the modules necessary
import databutton as db
import streamlit as st
import re
import time
from io import BytesIO
from typing import Any, Dict, List
import pickle
from langchain.docstore.document import Document
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.faiss import FAISS
from pypdf import PdfReader
import faiss
def parse_pdf(file: BytesIO, filename: str) -> Tuple[List[str], str]:
# Initialize the PDF reader for the provided file.
pdf = PdfReader(file)
output = []
# Loop through all the pages in the PDF.
for page in pdf.pages:
# Extract the text from the page.
text = page.extract_text()
# Replace word splits that are split by hyphens at the end of a line.
text = re.sub(r"(\w+)-\n(\w+)", r"\1\2", text)
# Replace single newlines with spaces, but not those flanked by spaces.
text = re.sub(r"(?<!\n\s)\n(?!\s\n)", " ", text.strip())
# Consolidate multiple newlines to two newlines.
text = re.sub(r"\n\s*\n", "\n\n", text)
# Append the cleaned text to the output list.
output.append(text)
# Return the list of cleaned texts and the filename.
return output, filename
def text_to_docs(text: List[str], filename: str) -> List[Document]:
# Ensure the input text is a list. If it's a string, convert it to a list.
if isinstance(text, str):
text = [text]
# Convert each text (from a page) to a Document object.
page_docs = [Document(page_content=page) for page in text]
# Assign a page number to the metadata of each document.
for i, doc in enumerate(page_docs):
doc.metadata["page"] = i + 1
doc_chunks = []
# Split each page's text into smaller chunks and store them as separate documents.
for doc in page_docs:
# Initialize the text splitter with specific chunk sizes and delimiters.
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=4000,
separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
chunk_overlap=0,
)
# Split the document's text into chunks.
chunks = text_splitter.split_text(doc.page_content)
# Convert each chunk into a new document, storing its chunk number, page number, and source file name in its metadata.
for i, chunk in enumerate(chunks):
doc = Document(
page_content=chunk, metadata={"page": doc.metadata["page"], "chunk": i}
)
doc.metadata["source"] = f"{doc.metadata['page']}-{doc.metadata['chunk']}"
doc.metadata["filename"] = filename
doc_chunks.append(doc)
# Return the list of chunked documents.
return doc_chunks
We will parse each of the uploaded PDFs, split the text, and chunk them to create a list of documents. Note: we ensure that all the information of the metadata is well retained.
🛠️ Indexing is crucial while working with LLMs
A vector database does not store and work directly with text, hence it is important to convert texts to vectorised form. This step is often referred to as applying embeddings— this captures the semantic and contextual information of the data. We are using the FAISS Python package to perform this step.
def docs_to_index(docs, openai_api_key):
index = FAISS.from_documents(docs, OpenAIEmbeddings(openai_api_key=openai_api_key))
return index
def get_index_for_pdf(pdf_files, pdf_names, openai_api_key):
documents = []
for pdf_file, pdf_name in zip(pdf_files, pdf_names):
text, filename = parse_pdf(BytesIO(pdf_file), pdf_name)
documents = documents + text_to_docs(text, filename)
index = docs_to_index(documents, openai_api_key)
return index
Here’s where I discuss embeddings & semantic search in detail
The best practice is to store embeddings is in a vector database. A vector database is immensely powerful and easy to work while detailing vectorized data. Popular vector stores are Pinecone or Weaviate.
🤩 Building the front-end
Let’s simultaneously build the front-end part where we would typically allow our end-user to upload any PDF file, index it, and finally chat with it!
# Import necessary libraries
import databutton as db
import streamlit as st
import openai
from my_pdf_lib import get_index_for_pdf
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
import os
# Set the title for the Streamlit app
st.title("RAG enhanced Chatbot")
# Set up the OpenAI API key from databutton secrets
os.environ["OPENAI_API_KEY"] = db.secrets.get("OPENAI_API_KEY")
openai.api_key = db.secrets.get("OPENAI_API_KEY")
# Upload PDF files using Streamlit's file uploader
pdf_files = st.file_uploader("", type="pdf", accept_multiple_files=True)
Next, we need to write a function which would create a vector database based on the content of the uploaded PDF files, index them, and store it as a session state. However, I would highly recommend to use a vector database to store such vector embeddings.
# Cached function to create a vectordb for the provided PDF files
@st.cache_data
def create_vectordb(files, filenames):
# Show a spinner while creating the vectordb
with st.spinner("Vector database"):
vectordb = get_index_for_pdf(
[file.getvalue() for file in files], filenames, openai.api_key
)
return vectordb
# If PDF files are uploaded, create the vectordb and store it in the session state
if pdf_files:
pdf_file_names = [file.name for file in pdf_files]
st.session_state["vectordb"] = create_vectordb(pdf_files, pdf_file_names)
Below is a schematic representation to illustrate that whenever a prompt comes from the end-user — the system would first interact with external databases such as vector databases instead of passing directly via the LLM.
Based on prior discussions in this blog and also the above schematic, we would like the system to first interact with external databases. Hence, writing a well-crafted customised prompt which is designed to take in further context (i.e. we are augmenting our prompt here) is crucial!
# Define the template for the chatbot prompt
prompt_template = """
You are a helpful Assistant who answers to users questions based on multiple contexts given to you.
Keep your answer short and to the point.
The evidence are the context of the pdf extract with metadata.
Carefully focus on the metadata specially 'filename' and 'page' whenever answering.
Make sure to add filename and page number at the end of sentence you are citing to.
Reply "Not applicable" if text is irrelevant.
The PDF content is:
{pdf_extract}
"""
Note: The above prompt is not robust or well tested and is solely crafted for this demo app. The prompt can be further tweaked and tested (please leave your suggestions in the comment section below if you have better prompts)
💬 Building the Chat UI
This is a typical Streamlit ChatUI which we will use for this chatbot.
# Get the current prompt from the session state or set a default value
prompt = st.session_state.get("prompt", [{"role": "system", "content": "none"}])
# Display previous chat messages
for message in prompt:
if message["role"] != "system":
with st.chat_message(message["role"]):
st.write(message["content"])
# Get the user's question using Streamlit's chat input
question = st.chat_input("Ask anything")
# Handle the user's question
if question:
vectordb = st.session_state.get("vectordb", None)
if not vectordb:
with st.message("assistant"):
st.write("You need to provide a PDF")
st.stop()
For understanding each step better, refer to my previous blogs!
⚙️ Retrieving the semantically similar contexts from Index Store
Fetching relevant contexts to augment our prompt! This part is very crucial in our RAG enhanced chatbot. When the user passes the query, we want to ensure that we get the top N number of semantically similar hits from our vectorized data.
# Search the vectordb for similar content to the user's question
search_results = vectordb.similarity_search(question, k=3)
#search_results
This increases the relevancy of responses a simple chatbot lacks
➕ Augmenting the semantically relevant context with the prompt
We loop over all the list of search_results
and concatenate them in a single string, which will later be passed to the prompt.
pdf_extract = "/n ".join([result.page_content for result in search_results])
# Update the prompt with the pdf extract
prompt[0] = {
"role": "system",
"content": prompt_template.format(pdf_extract=pdf_extract),
}
🌊 Generating Responses 🤖
Next, we pass the prompt and multiple contexts back to the LLM to generate a relevant answer based on the end-user query. Also, we stream the generated responses from the LLM to give a Chat-GPT like vibe!
# Add the user's question to the prompt and display it
prompt.append({"role": "user", "content": question})
with st.chat_message("user"):
st.write(question)
# Display an empty assistant message while waiting for the response
with st.chat_message("assistant"):
botmsg = st.empty()
# Call ChatGPT with streaming and display the response as it comes
response = []
result = ""
for chunk in openai.ChatCompletion.create(
model="gpt-3.5-turbo", messages=prompt, stream=True
):
text = chunk.choices[0].get("delta", {}).get("content")
if text is not None:
response.append(text)
result = "".join(response).strip()
botmsg.write(result)
# Add the assistant's response to the prompt
prompt.append({"role": "assistant", "content": result})
# Store the updated prompt in the session state
st.session_state["prompt"] = prompt
prompt.append({"role": "assistant", "content": result})
# Store the updated prompt in the session state
st.session_state["prompt"] = prompt
☕️ Conclusion
Congratulations! Together we built a chatbot enhanced with RAG (Retrieval-Augmentation-Generation) 🎉
In brief, typically the building blocks of such RAG based chatbots include:
a) Data retrieval from vectorized user information
b) Context-based augmentation of prompts based on end-users queries
c) Generation of more reliable responses to end-user queries
Integrating such RAG based approaches for customised LLM based products increases the chance of getting more context-relevant and precise information, as well as ensuring that the responses are tailored to specific user queries. All-in-all just a better chatbot experience.
Check out this informative piece by Trygve Karper, where he discusses a handful of YC startups leveraging the power of RAG — https://medium.com/databutton/some-ycombinator-rag-startups-cba3cca88274
To get started quickly, you can use the “Chat with PDF” template within Databutton 🚀
Alternatively, you can find the entire code in this repo: https://github.com/avrabyt/RAG-Chatbot/tree/main
Also explained in details over this video :
https://youtu.be/Yh1GEWqgkt0?si=p-gu9CBl4GTK4ESx
📖 Hungry for more? I would urge you to read the following!
- LangChain RAG document — https://python.langchain.com/docs/expression_language/cookbook/retrieval
- LangChain Blog on RAG — https://deci.ai/blog/retrieval-augmented-generation-using-langchain/
- Retrieval augmented generation: Keeping LLMs relevant and current — https://stackoverflow.blog/2023/10/18/retrieval-augmented-generation-keeping-llms-relevant-and-current/
- LamaIndex RAG concepts — https://gpt-index.readthedocs.io/en/latest/getting_started/concepts.html
- What is RAG from IBM — https://research.ibm.com/blog/retrieval-augmented-generation-RAG
- RAG based YC Startups — https://medium.com/databutton/some-ycombinator-rag-startups-cba3cca88274
Acknowledgement
Thanks to Björn Lapakko for proof reading and candid feedbacks 💜