Enhancing Retrieval Augmented Generation with Two-Stage Retrieval, FlashRank, and Query Expansion

3 min readSep 3, 2024

In the era of large language models (LLMs) and Retrieval Augmented Generation (RAG), one of the biggest challenges is efficiently retrieving and utilizing relevant information from vast datasets. This is where two-stage retrieval systems come into play, designed to balance the competing needs of retrieval recall and LLM recall. By leveraging both FlashRank and Query Expansion techniques, we can further refine this process, leading to more accurate and cost-effective outcomes.

What is a Two-Stage Retrieval System?

A two-stage retrieval system is a sophisticated method used in RAG pipelines that involves two critical steps: initial retrieval and reranking.

Initial Retrieval: This first stage uses a vector database powered by bi-encoders or sparse embedding models to quickly retrieve a set of potentially relevant documents. The primary focus here is on retrieval recall — casting a wide net to ensure that as much relevant information as possible is captured.
Reranking: After the initial retrieval, the second stage employs a reranker, often a cross-encoder, to reorder the documents based on their relevance to the user’s query. This step is crucial for maximizing LLM recall, ensuring that the most contextually relevant documents are prioritized and passed to the LLM.

Why Use Rerankers?

While bi-encoders are efficient in compressing a document’s meaning into a single vector, this process can lead to information loss. Rerankers, however, perform a more nuanced analysis by examining both the query and the document together, resulting in more precise relevance scores. The trade-off is that rerankers are slower, as they require full transformer inference for each query-document pair, unlike vector searches that use precomputed embeddings.

The Role of Context Windows in LLMs

A key limitation when working with LLMs is the context window — the maximum amount of text the model can process at one time. While increasing retrieval recall by fetching more documents seems beneficial, it can paradoxically degrade LLM recall if the additional information is not highly relevant. This occurs because LLMs struggle to effectively utilize information that is buried deep within their context windows.

Optimizing Retrieval with FlashRank

Enter FlashRank, an innovative approach that enhances the reranking process by dynamically adjusting the context window based on the relevance of documents retrieved in the first stage. FlashRank aims to improve both retrieval recall and LLM recall by ensuring that the most pertinent documents are focused on, without overwhelming the LLM with irrelevant information.

Dynamic Context Window Adaptation: FlashRank adjusts the size of the context window depending on the relevance of documents. By doing so, it allows the LLM to concentrate on the most significant content, enhancing LLM recall.
Cost Efficiency: Instead of reranking every document, FlashRank selectively focuses on the most promising ones, reducing the computational cost and speeding up the reranking process.

Enhancing Initial Retrieval with Query Expansion

Another technique to improve the initial retrieval stage is Query Expansion. This approach broadens the search criteria by expanding the original user query with additional related terms or synonyms. By doing so, it captures a more diverse range of documents, thereby increasing retrieval recall.

Semantic Enrichment: Query Expansion involves enriching the original query with semantically related terms, such as synonyms or related concepts. This broadens the search and helps in retrieving documents that might not directly match the original query but are contextually relevant.
Improved Recall: By expanding the scope of the query, Query Expansion increases the likelihood of retrieving all relevant documents, particularly in domains with varied terminologies.

Combining FlashRank and Query Expansion for Optimal Results

Integrating FlashRank and Query Expansion into a two-stage retrieval system creates a powerful synergy:

Enhanced Initial Retrieval: Query Expansion increases the number and diversity of documents retrieved, ensuring a comprehensive set of potentially relevant information.
Refined Reranking: FlashRank then refines this expanded set, honing in on the most contextually relevant documents to feed into the LLM, optimizing the use of its context window.

By leveraging these techniques, RAG pipelines can effectively balance retrieval recall with LLM recall, enhancing overall performance and efficiency.

Conclusion

In the evolving landscape of AI and LLMs, optimizing retrieval systems is crucial for maximizing the effectiveness of RAG pipelines. Two-stage retrieval systems, when augmented with strategies like FlashRank and Query Expansion, offer a robust solution. They not only ensure comprehensive retrieval of relevant information but also fine-tune this information to make the most efficient use of the LLM’s capabilities. By adopting these advanced techniques, we can pave the way for more accurate, efficient, and cost-effective AI applications.