Optimizing RAGs: Overcoming Architecture Hurdles for Peak Performance — Part 2

Anurag Mishra
5 min readDec 4, 2023

--

The Challenges in different components of RAGs based architecture
The Challenges in different components of RAGs based architecture

Introduction

As in this series of optimizing RAGs solution, we delve into the prevalent challenges commonly confronted during the construction of scalable and production-ready RAGs solutions. Kindly visit to the first part of this series here if you haven’t read it.

Third Component: Storing Embeddings for faster retrieval

Once the generation of embeddings is underway, it becomes crucial to store these embeddings efficiently to facilitate swift retrieval. Hence, a specialized approach is imperative for managing this data type effectively.

There are various methods available for storing embeddings, each catering to specific requirements:

  1. Vector Libraries: These repositories store vector embeddings within in-memory indexes to facilitate seamless similarity searches. Key characteristics of most vector libraries include exclusive storage of vectors, immutability of index data, and limitations on querying during import. However, for a production-ready application, opting for a vector library might not be the most optimal choice. Examples of such libraries include FAISS and annoy.
  2. Vector Stores: One of the core features that set vector databases apart from libraries is the ability to store and update your data. Vector databases have full CRUD (create, read, update, and delete) support that solves the limitations of a vector library. Additionally, databases are more focused on enterprise-level production deployments. Eg.: Weaviate, Pinecone, Redis etc

In addition to vector stores, a prevalent trend involves storing documents as a knowledge graph in a Graph Database like Neo4j. This approach enhances information handling through more structured representation.

Fourth Component: Ways to Handle Query to LLM

Sometimes the query, that user put to RAG, requires to be processed in order to find most relevant response. There are below certain operations that can be done before directly document retrieval

  1. Query Decomposition: When a query requires processing across multiple documents, it is advisable to decompose it into distinct queries and process them separately and combine the response.
Query Decomposition

2. Multi-Document Agents: If there is certain fixed operation/documents that requires to be performed/processed, then rather having single flow better to build multiple agents to optimize the flow

3. Query Rewriting: Query rewriting is a strategy for improving relevancy. It an be used to rewrite incoming queries prior to submitting. These rewrites produce more relevant search results with higher conversion rates.

Additionally, the choice of approach may vary depending on the specific type of query handled by your RAGs.

Fifth Component: Retrieve Better

As the heart of the entire RAGs is retrieval, which significantly influences RAG performance. Various methods are employed for retrieval, with the primary ones outlined below:

  1. Keyword Search: Uses traditional full-text search methods. Content is broken into terms through language-specific text analysis. It is good when the exact results has to searched
  2. Vector Search: Documents are converted from text to vector representations using an embedding model. Retrieval is performed by generating a query embedding and finding the documents whose vectors are closest to the query’s.

For an in-depth exploration of the aforementioned search methods

3. Hybrid Search: Performs both keyword and vector retrieval and applies a fusion step to select the best results from each technique

4. Hybrid Search + Semantic Ranking: Semantic ranking uses the context or semantic meaning of a query to compute a new relevance score over preranked results from Hybrid Search.

However, simple vector similarity search might not be sufficient when the LLM needs information from multiple documents or even just multiple chunks to generate an answer. Eg. For example, consider the following question:

Did any of the former OpenAI employees start their own company?

If you think about it, this question can be broken down into two questions.

  • Who are the former employees of OpenAI?
  • Did any of them start their own company?

And if the answer of these questions exist in two different paragraph or documents then vector similarity search might struggle with such multi-hop questions. In these case, knowledge graph can be more appropriate approach

These methods of implementation for retrieval each have their merits and drawbacks. The choice depends on specific use-cases; select the one that aligns best with your requirements.

Additionally, augmenting retrieval with metadata can further refine the selection of relevant documents.

Sixth Component: Improve and Evaluate RAGs Response

Improving LLM Context: In RAGs architecture, we combine relevant documents within LLM prompt as context. The number of documents, that are being passed, could impact the LLM’s response.

Evaluation of Retrieval: Normalized discounted cumulative gain is a measure of ranking quality. The value of NDCG is determined by comparing the relevance of the items returned by the search engine to the relevance of the item that a hypothetical “ideal” search engine would return.

To evaluate RAGs: Generally there are mainly two approaches are being used to evaluate RAGs

  1. Human-In-the-Loop: Human evaluation provides the most natural measure of quality but does not scale well. Developing a testing framework where Human evaluator assess response on paraments like relevance, harmfulness, hallucination on different values of chunk size, model arguments, retrieval method.
  2. LLMs as Judge: Recent research has proposed using LLMs themselves as judges to evaluate other LLMs, an approach called LLM-as-a-judge demonstrates that large LLMs like GPT-4 can match human preferences with over 80% agreement when evaluating conversational chatbots.

Conclusion:

In this article, we explored the different ways to store embeddings’ (challenge-3), How to better query the RAGs (challenge-4), how to retrieve better more relevant chunks (challenge-5) and how to improve and evaluate RAGs (challeng-6)

As this is highly changing area so I will keep updating this article for new advancement. I frequently write about developments in Generative AI and Machine learning, so feel free to follow me on LinkedIn (https://www.linkedin.com/in/anurag-mishra-660961b7/)

References:

--

--

Anurag Mishra

Building Scalable AI Solution | Senior Technical Lead - Data & Analytics @ EY | CSE @ NIT MN | Can always talk about AI, Anime, Books, Chess & Trading