Improve Retrieval Augmented Generation (RAG) with Re-ranking

7 min readFeb 24, 2024

In the world of GenAI, you’ll often come across the term RAG (Retrieval augmented Generation). Essentially, RAG is about giving additional relevant information (context) to large language models (LLMs) along with a query to help them generate better and relevant responses.

Setting up a basic RAG system isn’t too complicated, but it often falls short in delivering highly accurate responses. One of the main reasons for this is the setup doesn’t always provide the most precise context to the LLM.

In the architecture depicted below, only the top_k responses from the vector search are passed to the LLM as context. However, what happens if there are other returned vectors (amber colored) that contain more relevant information related to the query? In such cases, we’re not passing this additional relevant information to the LLM and this can lead to the generation of less accurate responses by the LLM.

The concern here is that by only considering the top_k responses, we might miss out on valuable context that could improve the accuracy of the LLM’s responses. This limitation highlights the need for a more robust approach to selecting and providing context to the LLM, ensuring that it has access to the most relevant information to generate accurate responses.

What is the problem ?

In RAG, the primary focus lies in conducting semantic searches across extensive datasets, which could contain tens of thousands of documents. To perform semantic search, these documents are transformed into vectors, enabling comparison with query vectors using similarity metrics like cosine similarity.

However, during the conversion of documents into vectors, there’s potential for information loss as vectors represent content in a compressed numerical format. Additionally, larger documents often need to be divided into smaller chunks for embedding into vector format, making it challenging to maintain the context across all the smaller parts.

When implementing vector search in RAG, the issue of context loss becomes apparent. This is because we typically only consider the top_k results from the vector search, potentially overlooking relevant information that falls below this cutoff. Consequently, when the LLM receives top_k results as context that may not be entirely relevant to the query, it can lead to poor response quality from the LLM.

The reason why we can’t simply send all search results from vector search to the LLM is twofold:

LLM Context Limitation: LLMs have constraints on how much text can be passed to them, known as the “context window.” While recent advancements have led to larger context windows, such as Antropic’s Cloude with 100K tokens or GPT-4 with 32K tokens, a larger context window doesn’t guarantee more accurate results. Even with a larger window, there’s still a limit to how much information the LLM can effectively process.
LLM Recall Performance: LLM recall refers to the ability of the model to retrieve information from the given context. Research indicates that LLM recall performance can degrade if too many tokens are included in the context window. Therefore, simply stuffing more information into the context window isn’t always a viable solution, as it can negatively impact the LLM’s ability to recall relevant information.

These points are discussed in more detail in the paper referenced here

Up to this point, it’s evident that implementing RAG involves more than just storing documents in a vector database and layering an LLM on top. While this approach may yield results, it falls short of delivering production-grade performance.

What is fix ?

As part of refining RAG implementation, one crucial step is re-ranking.

A re-ranking model is a type of model that calculates a matching score for a given query and document pair. This score can then be utilized to rearrange vector search results, ensuring that the most relevant results are prioritized at the top of the list.

RAG implementation with vector search and re-ranking

In summary, the initial step involves retrieving relevant documents from a large dataset using vector search due to its speed. Once these related documents are obtained, re-ranking is applied to prioritize the most relevant ones at the top. These top-ranked documents, which align closely with the user’s query, are then passed to the LLM to enhance the accuracy and precision of results.

It’s important to note that re-ranker models are typically slower compared to vector search. Therefore, they are not utilized in the initial step of finding relevant documents related to the user query to maintain efficiency.

Implementation

Lets use Hugging Face dataset library to leverage any of the existing dataset as sample in this example.

We will also use existing dataset which contains ML papers on Hugging Face. Papers for ArVix seems a good source so we will use below returned first data source

Load dataset using dataset library from Hugging Face. Mentioned dataset seems to have more than 100K items.

Initiate OpenAI and Pinecone objects

Create pinecone index to store embeddings. We will be creating index with same vector dimensions to matching with vector dimension from embedding mode Ex. ada-002 has 1536 vector dimension

To efficiently store and manage our data from Hugging Face, which includes fields like ‘title’ and ‘abstract’, we’ll be storing the embeddings of the ‘abstract’ field in a vector database. Additionally, we’ll include the ‘title’ and ‘metadata’ fields to maintain a plain text representation of the embedded data. This approach allows for easier interpretation of the data associated with each vector when retrieving search results from the vector database.

To store or update (upsert) records in the vector store while preserving metadata for each individual record, we’ll need to define a mapping of objects. This mapping will enable us to associate each vector with its corresponding metadata, such as the title and other relevant information. By establishing this mapping, we can efficiently manage and query our data while retaining important contextual details for each record. For further reference on how to perform upsert operations with metadata, you can consult the documentation or resources specific to your chosen vector database.

This is important step where we create embeddings for input data and store embeddings in vector DB

Now we have all the data embedded in vector DB, lets try to find details from database with simple question and we get the results for matching top 25 vectors as shown below.

Now, let’s delve into the critical aspect of ranking the matching vectors returned by the query.

In this discussion, we’ll utilize an existing Coher model for reranking, focusing on the implementation within RAG rather than the intricacies of building and training reranking models. It’s worth noting that there are methods available to train custom reranking models tailored to specific requirements.

To begin, we’ll instantiate a Coher reranking model and pass it the returned items from vector search alongside the original query. This process yields re-ranked results, enhancing the relevance of the returned documents.

In the results block displayed below, items numbered 11, 24, and 16 appear to be the most relevant to the query based on ranking. Typically, these blocks might have been disregarded during top-k filtering, potentially leading to the LLM receiving less relevant context. This underscores the significance of reranking in refining the relevance of search results.

In comparing normal search results to reranked search results, there’s a notable shift in the ranking of blocks. For instance, the 0th block in normal results is replaced by the 11th block in reranked results, the 1st block by the 24th block, and so forth.

Let’s consider comparing two blocks: the 2nd block from normal search with the 16th block from reranked search. This analysis reveals a clear distinction, indicating that the 16th block from reranked search is significantly more relevant to our query compared to the 2nd block from normal search. Such a comparison underscores the efficacy of reranking in improving result relevance and ensuring that top-ranked documents better align with user queries before passing them in context to LLM.

Conclusion

In summary, this article has demonstrated the substantial benefits of reranking within the RAG framework. By implementing a reranking process, we’ve observed a notable increase in the relevance of retrieved information. This enhancement translates to significantly improved performance for RAG, as we maximize the inclusion of pertinent information while minimizing noise input into our LLM.

Through our exploration, we’ve highlighted how a two-stage retrieval system offers the advantages of both scale and quality performance. Utilizing vector search enables efficient searching at scale, while the incorporation of reranking ensures that only the most relevant documents are prioritized, thus enhancing the overall quality of results within the RAG framework.

Ref : https://www.pinecone.io/learn/series/rag/rerankers/

Improve Retrieval Augmented Generation (RAG) with Re-ranking

What is the problem ?

What is fix ?

Implementation

Conclusion

Written by ASHPAK MULANI