Improving RAG Retrieval Quality: 4 Optimization Guidelines from Shortwave

pamperherself
7 min readSep 6, 2024

--

shot by pamperherself

Previously, I wrote an article titled Hands-on Experience with AnythingLLM: RAG Tools Seem Superficial and Too Rigid, which touched on optimizing the RAG mechanism and improving retrieval quality. Recently, I revisited a long-saved article on this topic (finally had time to read it). It turns out that even back in March, this was already an issue. Even the CEO of Perplexity AI expressed frustration over the poor quality of content retrieved using RAG’s vector similarity, often failing to find the most accurate information.

This article provides insights into the challenges of RAG retrieval, using the design approach of the AI email assistant company Shortwave as a case study. It’s rare for companies to disclose the detailed technical mechanisms behind their products.

In Section 01, we’ll also explore whether to choose fine-tuning or RAG, and whether RAG is still needed with LLMs that support long context windows.

shot by pamperherself

01

Long-context LLMs (such as the 10M tokens of Gemini 1.5) are now available, and many believe RAG is unnecessary, as hundreds of documents can be uploaded directly for the LLM to read and process.

However, three key issues must be considered:

  1. Relevance: Uploading such a large volume of documents still requires assessing which ones are most relevant to the query; otherwise, the answer may deviate.
  2. Performance Impact: How does uploading a large number of documents affect model performance? Can the model efficiently process and answer related questions?
  3. Computational Cost: Uploading too many documents significantly increases computational costs, which is why many LLMs still cannot support extensive context windows. For example, GPT-4 might crash when processing 30k characters of content.

An interesting analogy for long-context windows is that, although our RAM is large enough, many operations still need to read and write to the hard drive rather than storing everything in RAM.

RAG & Fine-tuning: Which to Choose and What’s the Difference?

https://arxiv.org/abs/2401.08406
https://arxiv.org/abs/2312.05934

According to the research from the above papers:

RAG generally outperforms (supervised/unsupervised) fine-tuned language models in terms of generation quality, especially in scenarios requiring external knowledge.

RAG not only maintains high performance while using fewer computational resources but also has the flexibility to handle information retrieval accuracy issues. Specifically, when the retrieved information is inaccurate or harmful, RAG allows adjustments or replacements in the index without retraining the entire model.

Furthermore, RAG’s modular design allows organizations to use dedicated knowledge bases according to their needs, avoiding mixing all data in an opaque black-box model, thereby enhancing the model’s transparency and customizability. This architecture is especially attractive to enterprises and research institutions as it better manages proprietary data and knowledge.

02

Current RAG systems perform best when the knowledge base is highly relevant, dense, and contains detailed information.

To ensure RAG’s efficiency, it’s crucial to focus on initial retrieval, re-ranking mechanisms, or evaluators like EffectiveGPT, which is mentioned in the last part of the linked article, for stringent quality evaluations:

  • Relevance: The most crucial evaluation criterion. Use Mean Reciprocal Rank (MRR) to evaluate the system’s ranking by relevance. The system should ensure that the most relevant content appears at the top, which is a key indicator of initial retrieval quality.
  • Density: When relevance is similar, prefer documents with higher knowledge density, meaning those of higher quality and more refined information, such as content written by experts.
  • Details: Especially when tool invocation is involved, detailed descriptions of the knowledge base or tool can help the LLM better understand each knowledge base’s purpose, especially in scenarios requiring precise calls, such as SQL databases.

Currently, vector-based semantic similarity search results may not be as accurate as traditional keyword searches (like BM25). Therefore, combining semantic similarity search with traditional keyword search, as suggested by the CTO of Sourcegraph, can enhance the overall quality of RAG retrieval.

Practical measures to improve RAG can be referenced from the AI Search mechanism of the AI email assistant Shortwave:

Key Components of the Optimization:

  • Query Reformation: Rewrite or optimize the user’s original query to make it more suitable for the retrieval system. For example, restructuring sentences, removing ambiguities, or adding contextual information can improve the clarity and relevance of the query.
  • Feature Extraction: Extract key features from the user’s query, such as time, location, and names. These features help define the retrieval scope and enhance matching accuracy.
  • Recency Bias Extractor: This tool identifies and prioritizes information most relevant to the current time. In Shortwave’s AI system, it determines the timeliness of information based on features like dates or times mentioned in the query, giving higher priority to more recent content in the initial retrieval. This approach helps quickly filter the most relevant and up-to-date information, especially in scenarios where recency is critical.
  • Keyword + Embedding Initial Retrieval: Use keyword search for initial retrieval to find the most precise and obvious matches, ensuring high relevance. Complement it with vector search to handle synonyms, multimodal information, and queries with grammatical errors. Embedding methods help capture broader semantic relationships, even when keywords don’t fully match.
  • Heuristic Re-ranking: Use features extracted from the query to narrow the scope of vector search. For instance, if the query mentions time or names, these features should receive the highest weight (boost) during re-ranking to highlight the most relevant answers. Less relevant content is given lower weight or penalties to avoid irrelevant information interfering with the retrieval results.
  • The material most aligned with keywords receives the highest boost, shown as the peak of the curve below:
  • Cross-Encoder Re-ranking with MS Marco MiniLM: The MiniLM model runs locally to fine-tune matching scores based on the reformulated query and top-ranked information from the previous step.
  • Although MiniLM offers better results, it is computationally intensive. To save resources, it only re-ranks the top documents from heuristic embedding ranking.
  • Cross-Encoder inputs include the reformulated query and top-ranked documents from heuristic embedding. The model scores the match of each document with the reformulated query, generating more precise relevance scores.
  • The scores from the Cross-Encoder (MiniLM) further guide heuristic re-ranking, refining the ranking of retrieval materials. Specifically, high-scoring content receives boosts, while low-scoring content is penalized. This process dynamically adjusts the index to ensure final scores reflect the precise relevance of documents to the query.
  • Final Output: The final, optimized ranking results are provided to the LLM after dual optimizations from Cross-Encoder and heuristic re-ranking.

Summary of the 4 Mechanisms:

  • High relevance, high density, and richly detailed knowledge bases (these three standards can also serve as evaluators for retrieval quality).
  • Optimize and rewrite queries to make user questions clearer.
  • Extract relevant features from the query and use keyword search for precise matching.
  • One or more re-ranking mechanisms.

03

The example of Shortwave gave me a better understanding of AI products. It’s essentially about designing specific AI tools based on LLMs that meet user needs with a single click. These tools are encapsulated using various open-source models, custom prompts, and mechanisms.

For instance, the Summary tool below is quite simple, merely wrapped using prompts, whereas AI Search follows the steps in Section 02, involving complex scenario-specific processes.

Determining when to invoke these tools requires matching the tools to the user’s questions.

Postscript

Shortwave, as a commercialized AI writing company, shares RAG optimization mechanisms that align more closely with the current state of AI development. At the very least, it helps clarify the direction, allowing us to see how mature products operate.

To access substantial content, apart from academic papers, one should also follow updates from prominent figures in the AI field. Domestic information sources can often feel like mere headlines in comparison.

shot by pamperherself

By: pamperherself

My Chinese version AI Articles in WeChat Official Account @博金斯的 ai 笔记

--

--

pamperherself

AI and Fashion blogger | Portrait Photographer Youtube | Instagram : @pamperherself