Why your RAG is not working?

7 min readFeb 18, 2024

Question answering systems have existed for a long time now but with the advent of Large Language models like GPTs, Retrieval Augmented Generation has become a very promising solution of question answering by incorporating knowledge from external databases.

Information retrieval is a very important part of any QnA system. It helps in filtering out the most relevant documents from a plethora of documents. Curating a robust retrieval process is thus a very essential step in building reliable RAG systems.

This article delves into the inner workings of a naive RAG system, identifies its pitfalls — particularly in simple retrieval scenarios — and proposes innovative solutions to overcome these limitations, thereby paving the way for more sophisticated and efficient information retrieval mechanisms. All the codes can be found in this notebook.

Naive RAG & Simple retrieval

Lets first understand the working of a naive RAG. It follows a traditional process that includes indexing, retrieval and generation

Indexing: In the indexing process, plain text is extracted from docs (ppts, pdfs, html etc.) and is then segmented into smaller, more manageable chunks. These chunks are transformed into vector representations using an embedding model and are stored in a vector DB
Retrieval: The user query is converted into an embedding vector to run a similarity search on the index created in step 1. The system prioritizes and retrieves the top K chunks that demonstrate the greatest similarity to the query
Generation: The user query along with the top K chunks as context are synthesized into a prompt that is fed into LLM to generate a response

Pitfalls of simple retrieval

The way retrieval works in a Naive RAG system is by fetching the document chunks that are semantically close to the to the vector query in an embedding space. Just because items are semantically close as vectors under a particular embedding model, it doesn’t imply that they might contain the actual answer to the question. Lets look at these issues one by one, we will use Microsoft’s Annual Report 2022 for examples

Limitations of general representation

The embedding model used for embedding queries and data might not have knowledge of the specific task or query at the time of information retrieval, leading to less than optimal performance because it uses only a general representation.

Relevancy and Distraction

A significant issue in vector-based retrieval is the presence of distractors — results that seem relevant due to their proximity in the embedding space but are contextually irrelevant. These distractors can mislead the AI, leading to suboptimal outputs. Let’s look at an example, from querying the Microsoft annual report 2022. Note that the chunk is completely irrelevant to the query

Embedding representation of retrieved chunks — Fig. Presence of distractors in the retrieved documents

Irrelevant Queries and Results

If a query is very irrelevant to the document(knowledge base), there is a high chance that we will not find any relevant chunk to answer that question. But since we are using a retrieval system as part of a RAG loop, you’re guaranteed to return the nearest neighbors. In this case, our context window is going to be made up entirely of distractors, which might mislead the LLM in generation step. In the example below, you can see that retrieved chunks are far spread from the actual query.

Fig. Irrelevant queries have wide spread retrieved chunks

Solutions

Now that we have seen the pitfalls of a simple retrieval system, lets have a look at some solutions that can improve the performance of retrieval system in RAG loop

Query Expansion

Query expansion is a technique used to improve search results by enriching the query with additional information to capture a broader range of relevant documents. In modern RAG systems, we use LLMs to generate additional context for query enhancement to enable better retrieval. Lets look at few such methods.

Expansion with generated answers:

In this technique, we pass the query to an LLM asking it to generate a hypothetical or imagined answer for the query. The generated/imagined answer is then concatenated with the original query to perform retrieval. The expanded query thus, may be able to recover relevant documents that had no lexical overlap with the original query.

Fig. Query expansion with hypothetically generated answer

Lets see the prompt that can be used to create the hypothetical answer.

prompt = """You are a helpful expert financial research assistant. Provide an example answer to the given question, \
        that might be found in a document like an annual report.
        Question: {query}"""
query="Was there significant turnover in the executive team?"
generate_answer_prompt= prompt.format(query=query)

Fig. Positional change in the query. Original(Red), Enhanced(Yellow)

Expansion with multiple queries:

In this technique, we use the LLM to generate several related queries to the original query and then pass these queries along with the original queries to the vector database. Then we pass all the extracted chunks to the generation step as context to answer the original query. This method reduces the gap by making the retrieval task less under-specified, so the system can better retrieve the target documents.

The prompt used to generate multiple queries looks like this,

prompt ="""You are a helpful expert financial research assistant. Your users are asking questions about an annual report. \
        Suggest up to five additional related questions to help them find the information they need, for the provided question. \
        Suggest only short questions without compound sentences. Suggest a variety of questions that cover different aspects of the topic.\
        Make sure they are complete questions, and that they are related to the original question.\
        Output one question per line. Do not number the questions.
        Query: {query}"""
query = "What were the most important factors that contributed to increases in revenue?"

Multiple queries cover more related parts of data than a single query, thus increasing chances of finding correct answer

Cross-Encoder Re-ranking

When we perform a retrieval search on the index, the retrieved chunks are not exactly in the order of their relevance. Given the limited context length of LLMs, its important to only pass chunks that are most relevant to the answer. Hence, we need to re-rank the retrieved chunks before the generation step so that the context includes the most relevant chunks.

In simple terms, cross encoding is a a method used to compare and understand the relationship between two pieces of text. Cross encoders process two input texts together as a single input. This allows the model to directly compare and contrast the inputs. It produces then an output value between 0 and 1 indicating the similarity of the input sentence pair.

The score is then used to rank the retrieved chunks in order of their relevance. From the re-ranked set of results, only the top ranked results are used for the generation step.

Let us look at the python code to perform re-ranking that can be plugged in anywhere needed. We will use the cross-encoder from the sentence transformer library

from typing import List
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
   
def rank_documents(cross_encoder:CrossEncoder,query:str,retrieved_documents:List[str]):
    """
    Ranks retrieved documents based on their relevance to a given query using a cross-encoder model.

    Parameters:
    - cross_encoder (CrossEncoder): A cross-encoder model from the sentence-transformers library.
    - query (str): The query string for which the documents are to be ranked.
    - retrieved_documents (List[str]): A list of document strings that have been retrieved as potentially relevant to the query.

    Returns:
    - dict: A dictionary where the key is the rank position (starting from 0 for the most relevant document)
      and the value is the document string. The documents are ranked in descending order of relevance to the query.

    Usage:
    ranked_docs = rank_documents(cross_encoder, query, retrieved_documents)

    Note: This function requires the sentence-transformers library and a pretrained cross-encoder model.
    """
    pairs = [[query, doc] for doc in retrieved_documents]
    scores = cross_encoder.predict(pairs)
    ranks = np.argsort(scores)[::-1]
    ranked_docs = {rank_num:doc for rank_num,doc in zip(ranks,retrieved_documents)}
    return ranked_docs

#usage
ranked_docs= rank_documents(cross_encoder,query,retrieved_documents)

Conclusion

While the Naive RAG system presents a valuable framework for integrating retrieval and generation in AI applications, its effectiveness is heavily relied upon the accuracy and relevance of the retrieval process. The proposed solutions are just a few techniques that can help improve the efficacy of the retrieval system. Various other techniques such as embedding adaptors, fine tuning the embedding models, and deep chunking approaches can be applied to enhance the performance of retrieval process. We shall study about other techniques in detail in upcoming articles