RAG: Part 5: Retrieval

9 min readApr 9, 2024

Retrieval is a crucial aspect of any information retrieval system, particularly in search engines, QnA. RAG combines information retrieval with LLMs to generate responses. While retrieval helps identify relevant documents, reranking refines these results, prioritizing the most pertinent information for the LLM.

In this series of blogs, we have discussed how can we chuck the data, convert it to embedding and store it in vector DBs.

In this Blog, we will delve into the retrieval process and details of reranking algorithms commonly used.

Retrieval

Retrieval is the process of finding documents or passages from a large collection that are most relevant to a given query. In RAG, the retrieved documents become the source material for the LLM to understand the context and generate its response.

Efficiency: Retrieval methods need to be efficient, especially for large document collections. Techniques like indexing and caching can significantly improve retrieval speed.
Relevance Tuning: Retrieval models can be tuned based on specific tasks and user needs. For instance, you might prioritize documents with factual information for some queries or creative content for others.

Search Methods

There are various approaches to information retrieval, each with its strengths and weaknesses:

Keyword Search: This is the most basic method, where documents are matched based on the presence of specific keywords from the query. While simple and efficient, it can struggle with synonyms, paraphrases, and complex information needs.
Vector Search: This method represents documents and queries as vectors in a high-dimensional space. Documents with similar semantic meanings will have closer vectors. Vector search is efficient for large datasets and captures semantic relationships better than keyword search.
Hybrid Search: Combining multiple search methods can leverage the strengths of each. For instance, you could use a keyword search for initial retrieval and then refine it with a Vector search for reranking or vice-versa.

Now that we have a list of the most similar chunks to that of a user query, we want to filter or reorder the chunks such that we can select the top k chunks and pass them to the LLM.

Reranking

Before we discuss reranking algorithms, let’s discuss Why Reranking Matters in RAG

Initial Retrieval Might Not Be Perfect: The initial retrieval stage in RAG can return a large pool of documents that might be generally relevant, but lack the specific focus needed for the LLM to generate an accurate response.
Diversity: Reranking helps to add diversity to the retrieved chucks, if 2 chucks have similar information, the similarity would be high but the information would be redundant
Improves Information Quality for LLM: Reranking ensures the most relevant and informative documents are fed to the LLM, leading to higher quality outputs.
Reduces Noise: By filtering out less relevant documents, reranking reduces the chance of the LLM incorporating misleading or irrelevant information.

Approaches —

Learning to Rank

LTR methods are a powerful approach for reranking in RAG systems. They leverage the power of machine learning to refine the initial retrieval results, prioritizing documents most relevant to the specific user query. Here’s a deep dive into LTR methods:

Core Concept:

Unlike traditional information retrieval methods with handcrafted ranking functions, LTR models learn to rank documents based on training data.
This training data consists of document-query pairs with an associated relevance score indicating how relevant the document is to the query on a predefined scale (e.g., binary relevance labels or graded relevance scores).
The LTR model analyzes this data, learning the patterns and features that distinguish highly relevant documents from less relevant ones.

There are three main categories of LTR models, each with its own approach:

Pointwise Methods: These models treat each document-query pair independently. They predict a relevance score for each document based on its features (e.g., keywords, document length) and the query. Common examples include linear regression, random forests, and gradient-boosting machines.
Pairwise Methods: These models directly compare pairs of documents within the retrieved set. They learn to determine which document in a pair is more relevant to the query. Common examples include RankSVM and ListNet.
Listwise Methods: These models consider the entire ranked list of documents at once. They aim to optimize a ranking metric like Mean Average Precision (MAP) or Normalized Discounted Cumulative Gain (NDCG) to produce the best overall ranking. Examples include LambdaMART and RankSVM.
LambdaMART: This model uses gradient boosting to learn a ranking function from labelled data.
RankSVM: This approach leverages SVMs to learn a hyperplane that separates relevant from irrelevant documents.

Cross-Encoding ReRanker

Cross-encoders are a relatively new approach for reranking in retrieval systems. While traditional methods like Learning-to-Rank (LTR) models focus on document features and relevance scores, Cross-Encoders take a learnable approach.

Core Concept:

Cross-encoders rely on neural network models trained to encode both the query and each retrieved document into a common vector space. These vector representations capture the semantic meaning of the text.
Documents with encoded vectors closer to the query’s encoded vector in this space are considered more relevant.

Training Process:

Data Preparation: Training data consists of query-document pairs with relevance labels indicating how relevant each document is to the query.
Model Architecture: The Cross-Encoder model typically consists of two sub-networks: Query Encoder: This sub-network takes a query as input and encodes it into a vector. Document Encoder: This sub-network takes a document as input and encodes it into a vector.
Similarity Learning: During training, the model learns to adjust its parameters such that the encoded query vector and the encoded vectors of relevant documents are closer together in the vector space compared to the encoded vectors of irrelevant documents. This loss function ensures the model prioritizes encoding queries and documents with similar semantic meaning.

Reranking with Cross-Encoders:

Encoding Stage: Once trained, the Cross-Encoder model is used to encode the query and each retrieved document from the initial retrieval stage.
Similarity Calculation: The cosine similarity between the encoded query vector and the encoded document vectors is calculated. Cosine similarity measures the angle between vectors; higher similarity scores indicate closer vectors and thus potentially more relevant documents.
Ranked List Generation: Documents are then ranked based on their cosine similarity scores with the query. Documents with higher similarity scores are positioned higher in the final ranked list.

A few commonly used models are Sbert, MiniLM, Colbert etc..

Maximum Marginal Relevance

MMR is a reranking technique used in information retrieval systems like RAG. It aims to achieve a balance between two key aspects of reranking:

Relevance to the Query: Documents retrieved should be highly relevant to the user’s query.
Diversity of Information: The retrieved documents should offer a variety of perspectives and avoid redundancy.

Core Concept:

Step 1: Initial Retrieval: A set of documents is retrieved based on the query using a standard retrieval method like keyword matching or vector search.

Step 2: Iterative Selection: MMR selects documents one by one, aiming to maximize the following criteria:

High Relevance to the Query: Each chosen document should be highly relevant to the user’s query. This ensures the overall retrieved set remains informative.
Minimal Similarity to Previously Selected Documents: Documents with high similarity to already chosen ones are penalized. This encourages diversity in the retrieved set, preventing redundancy and ensuring a wider range of information is captured.

Step 3: Similarity Measure: MMR typically uses a similarity measure like cosine similarity to assess the relatedness between documents and the query, as well as between documents themselves.

Step 4: Balancing Act: A weight (often denoted by λ) is used to balance the importance of relevance and diversity. A higher λ emphasizes relevance, while a lower λ prioritizes diversity. The optimal weight depends on the specific application and desired outcome.

Step 5: Final Ranked List: After iteratively selecting documents based on MMR criteria, a final ranked list is obtained. This list prioritizes highly relevant documents while incorporating diverse perspectives to provide a more comprehensive understanding of the topic.

BM25

Best Match 25 (Iteration 25th) is a widely used and effective method for ranking documents based on their relevance to a specific query.

Core Concept:

BM25 assigns a score to each document in a collection, indicating how relevant it is to the user’s query.
Documents with higher scores are considered more relevant and are positioned higher in the ranked list presented to the user.

BM25 takes into account several factors when calculating a document’s score:

Term Frequency (TF): This measures how often a query term appears in a document. The more frequent a term appears, the higher the initial weight given to the document. However, BM25 avoids simply rewarding raw frequency to prevent overly repetitive documents from dominating.
Inverse Document Frequency (IDF): This factor considers how rare a term is across the entire document collection. If a term appears in many documents, distinguishing relevant documents is considered less informative. Conversely, terms that are specific and occur in only a few documents receive a higher IDF weight, boosting the score of documents containing those terms.
Document Length: BM25 incorporates document length as a factor. Generally, shorter documents are considered more relevant, assuming they contain the necessary information within a concise format. This avoids overly lengthy documents from solely ranking high due to containing the query terms multiple times.

The BM25 score for a document (d) and a query (q) is typically calculated using the following formula:

score(d, q) = sum(tf(t, d) * idf(t) * (k1 + 1) / (k1 + tf(t, d)))

tf(t, d) represents the term frequency of term t in document d.
idf(t) represents the inverse document frequency of term t.
k1 is a tuning parameter that controls the influence of document length. Lower values emphasize shorter documents, while higher values give more weight to longer documents.

LLM ReRanker

LLM ReRanker is a relatively new approach that utilizes the capabilities of the Large Language Model (LLM) itself to refine the retrieved documents. Unlike traditional reranking methods that rely on pre-defined features or external models, LLM ReRankers leverage the LLM’s understanding of language and context to assess the relevance of retrieved documents for the specific query.

Core Concept:

Instead of training a separate reranking model, LLM ReRankers directly involve the LLM in the reranking process.
This approach allows the LLM to consider not only the retrieved documents themselves but also the context of the user’s query and the overall task at hand.

Implementation Approaches:

There are two main ways LLM ReRankers can be implemented:

Generative Ranking:

The LLM takes the query and all retrieved documents as input.
It then generates a ranking score or relevance label for each document based on its understanding of how well the document aligns with the query and contributes to the overall task.
This approach leverages the LLM’s ability to grasp the nuances of language and consider the broader context for effective reranking.

Extractive Ranking with Feedback:

The LLM first analyzes the query and generates a short summary or key points of what it considers the most relevant information needed to address the query.
This summary acts as a “feedback loop” for the retrieval stage. Documents that contain the information identified by the LLM in its summary are then prioritized in the final ranking.
This approach leverages the LLM’s ability to identify key information from text and guides the reranking process based on its understanding of the query’s intent.

Cohere has also developed a reranking algorithm which is simple to use.

Choosing the Right Reranking Algorithm:

The optimal reranking algorithm for your RAG system depends on various factors:

Data Availability: Supervised learning models like LTRs require labelled data, which might not be readily available.
Task Complexity: If your task demands a deep understanding of semantic relationships, neural rerankers might be more suitable.

Conclusion

Reranking is a powerful technique for enhancing the performance of RAG models. By incorporating a reranking stage, you can ensure your RAG system delivers the most relevant and informative responses to user queries.

Improved Accuracy and Relevance: By focusing on the most relevant information, reranking leads to more accurate and informative responses from the LLM.
Enhanced Recall: Reranking can help identify relevant documents that might have been missed in the initial retrieval stage.
Better Efficiency for LLMs: By providing higher quality inputs, reranking reduces the computational burden on the LLM, leading to more efficient processing.

Thanks for spending your time on this blog. I am open to suggestions and improvements. Please let me know if I missed any details in this article.

RAG: Part 5: Retrieval

Retrieval

Search Methods

Reranking

Approaches —

Learning to Rank

Cross-Encoding ReRanker

Maximum Marginal Relevance

BM25

LLM ReRanker

Conclusion

Written by Mehul Jain