Taking RAG Chatbots from POCs to Production

6 min readApr 12, 2024

Retrieval-augmented generation (RAG) chatbots have gained popularity recently as they offer an efficient method for making data searchable in a meaningful way. I previously wrote a detailed post on how to get started with your own chatbots, which you can find here: Building Your Own OpenAI-Powered Chatbot.

However, the industry has begun to realize that while initiating a RAG-based search is straightforward, making it genuinely useful for end consumers is a different challenge altogether. Common issues encountered with real-world data include incorrect responses, hallucinations, and no response even when data is available.

In this post, I will cover some strategies that can be used to improve the overall user experience. I have divided the approach into four parts: Data Processing — cleaning up and processing input data; Query Processing — enhancing user queries for better results; Response Processing — fetching meaningful information and presenting it; and finally, Enhancing Overall User Experience — processing consumer feedback to improve the system.

The image below should give an idea of the overall architecture of a RAG-based solution.

Data Processing

Setting up a RAG-based chatbot involves several steps, starting with data processing and pushing it to a vector database. Here are some improvements that can be made in this process:

Preprocess Data

Before pushing data to a vector database, ensure it is in a searchable state. Start by removing irrelevant content, including documents unnecessary for the LLM’s answers and noise data, such as special characters and stop words. Additionally, enhances semantic significance by replacing pronouns with names in split chunks, which can improve retrieval relevance.

Meaningful Metadata

This process can be as straightforward as manually grouping related documents. Alternatively, a more creative approach is to use the LLM to generate summaries of all documents provided as context. The retrieval step can then initially search over these summaries and delve into details only when necessary. Some frameworks offer this as a built-in feature.

Chunking

The ideal chunk size varies depending on the use case and can be small, medium, or large. The only way to determine the right size is through experimentation and validation. Consider having different chunk sizes for various query types; for example, direct point queries may require concise responses (small chunks), whereas open-ended queries may need detailed answers (larger chunks).

Embeddings Algorithm

Embedding-based similarity is a standard retrieval mechanism in RAG. Your data is segmented and embedded within the index. Incoming queries are also embedded for comparison against the index’s embeddings. This embedding is typically performed by a pre-trained model, such as OpenAI’s text-embedding-ada-002. However, experimenting with different embeddings and validating their effectiveness in your use case is advisable.

Vector Database

Choice Several vector databases are available, including Pinecone, Milvus, Weaviate, and Chroma. Select a database that suits your use case, considering factors like support for similarity algorithms and metadata search.

Query Processor

The whole purpose of building the solution is to respond to user queries. Therefore, it makes sense to spend effort on making the user queries more meaningful and figure out ways to find relevant information for them.

Enhance Query

Rephrasing: If the system doesn’t find relevant context for a query, use an LLM to rephrase the query and try again. Queries that seem similar to humans may not be similar in embedding space.
Query Routing: Dynamically route queries to different RAG processes based on the requirements of the downstream task. For example, route users asking for specific answers to query specific chunks.
Hypothetical Document Embeddings (HyDE): Use an LLM to generate a hypothetical response to the search query and use both for retrieval.

Multi-Query

Alternate Queries: Use LLM to generate rephrased queries and find responses for all in parallel.
Sub-queries: LLMs tend to work better when they break down complex queries. Incorporate this into your RAG system by decomposing a query into multiple questions.
Parallel Processes: Run parallel processes for certain intermediate steps to reduce overall processing time, for example multiple-query or sub-query approach mentioned above.

Prompt Engineering

Customize prompts to guide the LLM in responding appropriately. For example, a prompt for a customer support agent might be: “You are a customer support agent designed to be as helpful as possible while providing only factual information. ”
Experiment with different prompts and allow the LLM to rely on its own knowledge if it can’t find a good answer in the context.

Similarity Algorithm

Dense Passage Retrieval (DPR): Instead of traditional retrieval methods, the advanced RAG model uses DPR, a neural network-based approach for retrieving relevant passages. DPR is trained to understand the semantics of the query and the documents, leading to more accurate retrieval.
Hybrid Approaches: Consider using a combination of keyword-based search and embeddings. For example, use a keyword-based index for queries related to a specific product but rely on embeddings for general customer support.
Multiple Indexes: Having more than one index allows you to route queries to the appropriate index based on their nature. For example, have separate indexes for summarization questions, pointed questions, and date-sensitive questions. Define the purpose of each index in text, and at query time, the LLM will choose the appropriate option. Tools like LlamaIndex and LangChain can assist with this.

Response Processor

Once we have the data available for searching and converted to embeddings, as well as the user query which is also converted to embeddings, the next step is to optimize the response retrieval and presentation to the end user.

Number of Responses to be Fetched

A common technique used is not to rely on a single response chunk when creating a response. Fetch the top N responses based on the use case, and then use re-ranking or LLM-based summarization to create a meaningful response for the end user.

Re-ranking

Keep in mind that the most similar chunk might not be the most relevant response. After retrieving the top results, you can implement a re-ranking step where a separate model (e.g., a smaller language model or a classifier) scores the relevance of each retrieved document to the query. This can help filter out less relevant documents before feeding them to the larger language model.

Use Metadata to Refine

Adding metadata to your chunks and using it to process results is a very effective strategy for improving retrieval. Date is a common metadata tag to add because it allows you to filter by recency. Other metadata tags can include the source of information, review status of the information, past usage, and summary of the chunk. These tags can help decide whether the current chunk should be used as part of the response or not.

Enhance Experience — Continuous Improvement

Improving the overall Retrieval-Augmented Generation (RAG) experience is essential, but ultimately, the end user’s perception of response quality is crucial. The key to any solution is capturing user feedback and utilizing it to enhance the system.

RAG Performance Monitoring

Evaluate your model’s performance using appropriate metrics such as BLEU, ROUGE, or F1 scores, and conduct qualitative assessments to identify areas for improvement. In addition, there are tools like LangSmith that can help evaluate the current model in use.

User Feedback Capture

User feedback can be captured in various ways. For instance, you can present multiple responses for the same query and let the user choose the best one, or you can provide like or dislike buttons. If you implement two sets of re-ranking algorithms and offer both response options to the user, you can determine which algorithm needs improvement based on user preferences.

Enhancing Responses

An important method is to allow users to provide a counter-query along with their feedback, such as “the previous response is missing an example.” The model can then use the previous query, response, and new query to provide a better user experience. Additionally, caching previous user queries and feedback is beneficial. For example, you can track which types of responses were favored by users for previous queries and replicate similar behavior for future queries.

Conclusion

In this post, we discussed enhancing the quality of Retrieval-Augmented Generation (RAG) based solutions. We began by addressing the improvement of data quality, focusing on cleaning and enriching the data to make it more meaningful. Then, we explored query processing and examined strategies to refine queries to retrieve more relevant responses. Following that, we delved into methods to maximize the usefulness of the responses obtained for user queries. Lastly, we briefly outlined approaches to enable continuous improvement of the system.