Few things can improve your complex RAG project having multiple agents.

5 min readJun 18, 2024

Retrieval Augmented Generation (RAG) is widely used today for creating chatbots across various domains. Optimizing a RAG system is crucial for its success. However, many RAG projects fail to meet expectations. Here are a few key points to understand when working on a RAG project for a specific domain:

Domain Knowledge:

Ensure you have a deep understanding of the specific domain you are working on.For example, if you’re creating a chatbot for the medical domain, it needs to respond to medical-related questions. To do this effectively, you should understand the types of questions users might ask and how detailed the answers need to be.

Knowing this can help you decide how to break down the text into manageable chunks. For instance, if answers are typically found across several pages, you’ll need to create chunks that are large enough to include all necessary information but small enough to be processed efficiently. Understanding these aspects will guide you in designing a chatbot that provides accurate and comprehensive responses, tailored to the needs of users in the medical domain.

Quality of Data

High-quality and relevant data are essential. The chatbot’s performance heavily depends on the quality of the data it retrieves and generates from. Data must be accurate and up-to-date. Inaccurate information can mislead users and potentially have harmful consequences.

Data should be consistent in terms of format, terminology, and style. Inconsistencies can confuse the chatbot and result in inconsistent or conflicting responses. Preprocess the data as needed to ensure the retriever can accurately extract relevant information. This may involve tasks such as cleaning the data, standardizing formats, and removing noise or irrelevant content.

Effective Retrieval Mechanism

Implement an efficient retrieval mechanism to fetch the most relevant information. Poor retrieval can lead to irrelevant or incorrect responses.There are two primary types of vector search methods: sparse vector search and dense vector search. Each has its own characteristics, and combining both can offer significant advantages.

Sparse Vector Search

Sparse Vectorsearch relies on traditional term-based retrieval methods such as TF-IDF (Term Frequency-Inverse Document Frequency) or BM25.
Documents and queries are represented as high-dimensional sparse vectors where each dimension corresponds to a term from the vocabulary.
The presence or absence of terms and their frequency in the documents determine the vector values.

Dense Vector Search

Dense Vector search uses dense embeddings generated by neural networks, such as BERT or other transformer models.
Both queries and documents are embedded into continuous low-dimensional vector spaces.
Similarity is measured using metrics like cosine similarity or Euclidean distance.

Using both sparse and dense vector search methods in a hybrid approach can leverage the strengths of each, , providing several benefits:

Enhanced Relevance: Combines term-based precision with semantic understanding, improving overall retrieval relevance.
Robustness: Handles a wider variety of queries, including those requiring exact matches and those needing semantic interpretation.
Improved Coverage: Sparse methods can cover rare terms and specific matches, while dense methods cover broader, context-based matches.
Balanced Efficiency: Sparse methods can handle the bulk of retrieval quickly, while dense methods can refine and re-rank results for better accuracy.

Chat History and Query Refiner:

Integrating chat history and a query refiner into a Retrieval Augmented Generation (RAG) system can significantly enhance the user experience by improving the relevance and accuracy of responses, especially in complex and interactive conversations. However, it is essential to balance the benefits with the potential costs and latency involved.

Chat history involves keeping track of the conversation context, including previous user queries and system responses.Enables the chatbot to understand follow-up questions better by referring to previous interactions. This leads to more accurate and contextually relevant responses. It is not required to retrieved information from dataset and makes the process a bit faster.

A query refiner processes user queries to correct mistakes, clarify ambiguities, and enhance query quality before retrieving information. Implementing a query refiner can increase the computational cost and latency of the system, as additional processing is required before the retrieval stage. In our case, it is adding marginal improvement so we don’t use query refiner.

Generation Quality

Focus on the generation aspect of RAG to ensure that the responses are coherent, contextually accurate, and useful.

Evaluation

Regularly evaluate and test the system to identify and fix issues. This helps in improving the system’s performance and reliability over time. Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. RAG denotes a class of LLM applications that use external data to augment the LLM’s context. There are existing tools and frameworks that help you build these pipelines but evaluating it and quantifying your pipeline performance can be hard. This is where Ragas (RAG Assessment) comes in. You can generate a dataset and stored in langufse and evaluate.

Answer Relevance
Context Recall
Context Precision
Context Relevency
Context entities Recall

Experiement with following things:

Craft the prompt

Based on the promblem statement you’re trying to solve, make a prompt for the system. It is also one of the important step and if you didn’t get the as expected output then try to modify the system prompt.

Chunk size

Choosing an optimal chunk size is crucial for accurately fetching information and generating a correct response. You can start experimenting with chunk sizes from 256 up to 500. Before conducting these experiments, it is important to first integrate the evaluation pipeline using RAGAS. This process ensures that the system is set up correctly to assess the performance of different chunk sizes, enabling the identification of the most effective one for accurate information retrieval and response generation. There are different types of chunking methods such token and character based.

Embedding Model

Experimenting with different embedding models from the Hugging Face dashboard is essential to enhance the performance of the Retrieval-Augmented Generation (RAG) system

Reranker Model

When exploring different reranker models, it’s important to consider their impact on latency. Selecting an optimal reranker top-k and retriever top-k value can help manage this latency while still improving the performance of the system. By experimenting with various reranker models, you can identify the most effective one for refining the retrieval and ranking process in the system. I have experiemented with following reranker models.

Cohere reranker
Cross-encoder reranker
Flash reranker