RAG on the Fly
A simple way of querying documents for relevant data without the use of Vector Stores.
Traditional Retrieval Augmented Generation techniques utilizes the use of a vector database and Graph databases in newest forms. We study a use case where long term storage and retrieval of the data is not required rather a quick solution is needed. This is where we deploy a technique that follows the same procedure as ‘Traditional RAG’, but rather storing the text or chunks in a vector database it makes use of in-memory vector stores to query and generate results.
The long term storage is good for applications where a fixed set or fixed domain data is available for the use case and querying is from a fixed set of documents only.
Our use case dictates that the documents themselves would be very dynamic and real time, for example website data scrapped directly from a website or text received from an API provider.
In such cases it might seem lucrative to pass the whole text to the LLM along with the query but this poses the following problems:
- Text might be too long and the relevant data might be in the middle of the text which might cause the LLM to be ‘lost in the middle’.
- For exact responses passing the whole text for example in case of tables and reports might cause the LLM to hallucinate.
- Passing complete text is computationally expensive, and would be many folds costly in case of LLM API providers like OpenAI.
RAG on the Fly solves these problems by making use of the same strategy as normal RAG but introduces the use of In-memory vector databases for vector similarity search.
By use of In-memory vector databases we leverage indexing, which is done on the go, only the vectors need to be provided to the database. This is very good for use cases where there is a need for RAG but no need or resources available for a Vector DB and CRUD Operation setup. This makes it easier for the novice user to use RAG without the use of a Vector DB.
There are different In-memory Vector databases available for use some are listed as:
- FAISS (Faiss is a library for efficient similarity search and clustering of dense vectors.)
- HNSWLib (HNSWLib is an in-memory vector store that can be saved to a file. It uses the HNSWLib library.)
- Langchain In memory (LangChain offers is an in-memory, ephemeral vectorstore that stores embeddings in-memory and does an exact, linear search for the most similar embeddings.)
- Redis (Redis is a fast open source, in-memory data store.)
All these have integrations available in LangChain (https://python.langchain.com/docs/integrations/vectorstores/)
For our use case we make use of FAISS the simplest library in bare python to show the ease of use of the library.
Here is the step by step code for the RAG on the fly.
First perform necessary imports
import numpy as np
import faiss # Facebook AI Similarity Search
import re # Regular expressions
from openai import OpenAI
client = OpenAI(api_key="LOAD YOUR API KEY FROM ENV")
Define your helping functions these include text splitting
def split_full_text(text, chunk_size=500, overlap=50):
text = text.replace("\n", " ")
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunk = text[start:end]
chunks.append({"text": chunk})
start += chunk_size - overlap
return chunks
Now define your embedding and LLM response fucntions:
def get_openai_embeddings(texts, model="text-embedding-3-small"):
texts = [text.replace("\n", " ") for text in texts]
response = client.embeddings.create(input=texts, model=model)
embeddings = [data.embedding for data in response.data]
return embeddings
def get_openai_response(question, text):
try:
user_query = f"QUESTION: {question}\nTEXT: {text}"
text = (client.chat.completions.create(
model = 'gpt-4o-mini',
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_query}
]
).choices[0].message.content)
return text
except Exception as e:
return str(e)
Now we define the In-memory vector databases and use functions above to get response from the LLM.
def faiss_search(chunk_list, query_text, top_n=4):
try:
texts = [chunk['text'] for chunk in chunk_list]
chunk_embeddings = np.array(get_openai_embeddings(texts))
dim = chunk_embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(chunk_embeddings)
query_embedding = np.array(get_openai_embeddings([query_text])[0]).reshape(1, -1)
distances, indices = index.search(query_embedding, top_n)
most_relevant_chunks = [chunk_list[i] for i in indices[0]]
relevant_text = ' '.join([chunk['text'] for chunk in most_relevant_chunks])
openai_response = get_openai_response(query_text, relevant_text)
return openai_response
except Exception as e:
return str(e)
Finally we consolidate all above and write a function that only takes input The text and query.
def search_chunks(text, query_text):
try:
# Assuming split_full_text is a function that chunks the input text
split_chunks = split_full_text(text)
relevant_chunks = faiss_search(split_chunks, query_text)
return relevant_chunks
except Exception as e:
return str(e)
Pros:
- There is no need to setup any Vector Database for querying text.
- The Text querying is fast depending upon what LLM and embedding model is used.
- Simplest form or RAG
Cons:
- The real time querying might become slow if embedding model chosen is slow. We may need to do batched embedding and async approach to cater for this case.
- In local testing the response time might go up to 7 ~ 8 seconds all depending upon the length of the text and assuming our LLM responses in nearly constant time.
Conclusion:
RAG on the fly is a very useful and simple technique to query your documents for very specific answers by making use of In-memory vector stores.