This technique will make your LLM Smarter and more Context-Aware: RAG on Steroids

Shivansh Kaushik
7 min readOct 31, 2023

--

In recent times, Retrieval-augmented Generation (RAG) has emerged as a transformative paradigm, making a significant impact in the development of context-aware Large Language Model (LLM) applications. The rapid advancements in LLMs, led by models like OpenAI’s GPT series, have undeniably revolutionized natural language understanding and generation. These LLMs, honed on vast amounts of online data, have opened new horizons for human-AI interactions. However, inherent limitations have persisted, such as occasional inaccuracies and the challenge of verifying the sources of their responses. By seamlessly integrating retrieval-based and generative components, RAG empowers LLMs to tap into external knowledge sources.

Basic RAG pipeline

If you’re new to Retrieval Augmented Generation and LLMs in general, I would encourage you to check out my other articles before proceeding further:

I started working with LLMs and the learning about the concept of RAG a while back but as I went deep into this rabbit hole, I came across a technique that enhances the quality of the responses from the LLM and even made it more context aware.

The technique I’m talking about is RAG-Fusion!!

RAG-Fusion

The core idea behind RAG-Fusion is to enhance the capabilities of RAG by introducing additional steps in the workflow, ultimately leading to more refined and comprehensive text generation.

The Foundational Technologies

  1. General-Purpose Programming Language: RAG-Fusion, like its predecessor RAG, relies heavily on a general-purpose programming language, often Python. This language provides the framework for implementing the various components of the system.
  2. Dedicated Vector Search Database: The second key technology in RAG-Fusion is a vector search database, such as Elasticsearch or Pinecone. This database is responsible for retrieving relevant documents and data for the text generation process.
  3. Potent Large Language Model: RAG-Fusion harnesses the power of a large language model, such as ChatGPT, to craft the generated text. This language model is at the heart of the text generation process.

RAG-Fusion’s Workflow

RAG-Fusion extends the RAG workflow by incorporating additional steps to enhance the quality and depth of generated text. The workflow can be summarized as follows:

Multi-Query Generation

Traditional search systems rely on a single query input from users. However, this approach can be limiting. RAG-Fusion addresses this limitation by generating multiple queries from different perspectives. This multi-query generation is achieved through a technique known as prompt engineering and natural language models.

  • Prompt Engineering: The system issues specific instructions to the language model, guiding it to generate multiple queries related to the original user query.
  • Diversity and Coverage: The generated queries offer diverse angles or perspectives on the original question, ensuring a broader range of information is considered during the search process.

Reciprocal Rank Fusion (RRF)

RRF is a technique used to combine the ranks of multiple search result lists into a unified ranking. Developed in collaboration with the University of Waterloo and Google, RRF excels in producing better results than individual systems or standard reranking methods. This technique is particularly effective in combining results from queries that may have different scoring scales.

  • RRF Algorithm: The RRF algorithm calculates a new score for each document based on its ranks in different lists of results. This helps ensure that the most relevant documents appear at the top of the final list, making it a powerful approach for result aggregation.

Check out this paper to learn more about RRF

Generative Output

To preserve the user’s original intent and generate high-quality text, the reranked documents and all queries are fed into a large language model prompt, which follows a typical RAG approach. This step ensures that the final output considers all the queries and the reranked list of results, resulting in a nuanced and reliable text generation process.

Source: Forget RAG, the Future is RAG-Fusion

You can go through the concepts behind RAG-Fusion in this concise article by the creator of RAG-Fusion here: Forget RAG, the Future is RAG-Fusion

Implementation

  • Installing dependencies
pip install openai langchain weaviate-client huggingface
  • Imports
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Weaviate
import openai
  • Initialize your OpenAI API key
openai.api_key = "your-key"
  • Loading Data

We will be using an essay by Paul Graham as our context source. The essay has about 75,000 words, so it can be perfectly used for our experimental purposes.

loader = TextLoader('paul_graham.txt')
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)

texts = text_splitter.split_documents(documents)

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

We use Langchain’s built-in text splitter to split the text into 1000 word chunks, which would make it easier to find the relevant information precisely!

For generating embeddings, we use the all-MiniLM-L6-v2 model from Huggingface!

  • Vector DB Initialization

For this example, we will be using Weaviate as our Vector DB, since it is pretty straightforward and we can also easily retrieve the score for each retrieved document(which would be required for computing RRF).

weaviate_url = 'your-weaviate-url'
db = Weaviate.from_documents(texts, embeddings, weaviate_url=weaviate_url, by_text=False)
retriever = db.as_retriever(search_kwargs={"additional": ["certainty"]})

RAG-Fusion Pipeline

With our data in place, we now set up the functions and work on the actual RAG-fusion pipeline.

I’ll first share the entire code and then explain what each function does, which’ll help in understanding the process better.


def generate_similar_queries(original_query):
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant that generates multiple search queries based on a single input query."},
{"role": "user", "content": f"Generate multiple search queries related to: {original_query}"},
{"role": "user", "content": "OUTPUT (4 queries):"}
]
)

generated_queries = response.choices[0]["message"]["content"].strip().split("\n")
return generated_queries

def vector_search(query):
search_results = {}
retrieved_docs = retriever.get_relevant_documents(query)
for i in retrieved_docs:
search_results[i.page_content] = i.metadata['_additional']['certainty']
return search_results



def reciprocal_rank_fusion(search_results_dict, k=60):
fused_scores = {}

for query, doc_scores in search_results_dict.items():

for rank, (doc, score) in enumerate(sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)):
if doc not in fused_scores:
fused_scores[doc] = 0
previous_score = fused_scores[doc]
fused_scores[doc] += 1 / (rank + k)
print(f"Updating score for {doc} from {previous_score} to {fused_scores[doc]} based on rank {rank} in query '{query}'")

reranked_results = {doc: score for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)}
print("Final reranked results:", reranked_results)
return reranked_results


def generate_output(original_query, reranked_results):
reranked_docs = [i for i in reranked_results.keys()]
context = '\n'.join(reranked_docs)
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant that answers user's questions based on the context provided.\nDo not make up an answer if you do not know it, stay within the bounds of the context provided, if you don't know the answer, say that you don't have enough information on the topic!"},
{"role": "user", "content": f"CONTEXT: {context}\nQUERY: {original_query}"},
{"role": "user", "content": "ANSWER:"}
]
)

response = response.choices[0]["message"]["content"].strip()
return response



query = "How did the author come up with the name Ycombinator?"
generated_queries = generate_similar_queries(original_query)

all_results = {}
for query in generated_queries:
search_results = vector_search(query)
all_results[query] = search_results

reranked_result = reciprocal_rank_fusion(all_results)
final_output = generate_output(original_query, reranked_result)
print(f"Generated Response -> {final_output}")
  • Similar Query Generation

This function passes the original user query to the LLM and generates 4 similar queries. We do this to get variations of the original query that helps to look at it from multiple dimensions and understand the meaning of the query in a better fashion.

def generate_similar_queries(original_query):
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant that generates multiple search queries based on a single input query."},
{"role": "user", "content": f"Generate multiple search queries related to: {original_query}"},
{"role": "user", "content": "OUTPUT (4 queries):"}
]
)
generated_queries = response.choices[0]["message"]["content"].strip().split("\n")
return generated_queries
  • Retrieve Relevant Documents

The output we get from the previous function would be passed into the vector search function to retrieve all the relevant documents from our vector DB. This function returns a dictionary where the key is the retrieved text and value is the certainty score.

def vector_search(query):
search_results = {}
retrieved_docs = retriever.get_relevant_documents(query)
for i in retrieved_docs:
search_results[i.page_content] = i.metadata['_additional']['certainty']
return search_results
  • Reciprocal Rank Fusion

This function is the heart of our pipeline, here we pass in all the retrieved documents via the 4 queries and use RRF to re-rank them and return the most relevant documents, to learn more about Reciprocal Rank Fusion algorithm, check out this link

def reciprocal_rank_fusion(search_results_dict, k=60):
fused_scores = {}

for query, doc_scores in search_results_dict.items():

for rank, (doc, score) in enumerate(sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)):
if doc not in fused_scores:
fused_scores[doc] = 0
previous_score = fused_scores[doc]
fused_scores[doc] += 1 / (rank + k)
print(f"Updating score for {doc} from {previous_score} to {fused_scores[doc]} based on rank {rank} in query '{query}'")

reranked_results = {doc: score for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)}
print("Final reranked results:", reranked_results)
return reranked_results

This method provides rich context retrieval and serves as a major improvement in the concept of Retrieval Augmented Generation.

Just by looking at the user’s query from different perspectives, we are able to bring more sense into our output and make the LLM more context aware and helpful.

I took the original code from the author and made some tweaks to get it to work, but if you really want to understand the impact of this technique, you have to try it out yourself.

Check out this Github Repository for the complete code:

Follow for more such articles!

References

  1. Forget RAG, the Future is RAG-Fusion
  2. Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods

--

--

Shivansh Kaushik

ML Engineer and innovator, on a mission to create a positive impact in the world using the powers of AI.