Refine, Retrieve, Rebuttal: Lime’s RAG Flow

Chao Ma
Lime Engineering
Published in
6 min readApr 22, 2024

Starting the new year with a tech project is exciting, especially when it involves cutting-edge AI tools. Imagine being tasked with creating an RAG (Retrieval-Augmented Generation) chatbot for your organization — what an adventure! This task initially seems like a breeze thanks to the latest OpenAI models, like the remarkably accurate GPT-4 and its turbo version.

Partner this with the LangChain framework, and you have a solid foundation for rapidly developing an efficient RAG application.

Envision seamlessly integrates your Google Drive repository or JIRA with the chatbot. Combine that with powerful GPT models, and voilà! It’s ready to be launched as a handy Slack bot or a convenient web application.

However, as the project unfolds, unexpected challenges bubble up. As we are developing our own RAG chatbot, we have to be prepared to deal with challenges such as these: Here’s a few of these imaginative hurdles:

  • Curbing Hallucinations: Despite LLM’s prowess, it sometimes fabricates answers, known as ‘hallucinations’, particularly when relevant data is absent.
  • Dealing with Jokes: Humor is part of any workplace, and people love to test AI’s wit. How can you detect and block jokes?
  • Contextual Integration: Merging conversations can be challenging because it’s only sometimes appropriate to do so. We need to consider the context carefully because if a question relies on information from a previous one, merging without context could lead to the retrieval of incorrect knowledge. Conversely, if we indiscriminately merge all questions, we might retrieve irrelevant or inaccurate information when questions are unrelated.
  • Knowledge staleness: Knowledge can age like fine wine or sour milk. Old documents with outdated information may have high similarity to the question.
  • Unknown Answer Accuracy: LLM models have no confidence scores, so we are not able to know how confident the answer is. How doyou get to know the quality of answers?
  • Language Barriers: Language diversity adds complexity. If the question is in Italian, the knowledge is in French, and your prompt is in English, it is challenging for your LLM.
  • Sophisticating Retrieval: Distinguishing between a sea of similar knowledge is vital, especially when crucial details differ. If the embedding model cannot cover all the cases, how can you handle the failures?

While there are countless challenges we could dissect, today our spotlight shines brightly on the solutions. It is these creative fixes and strategic improvements that will transform our RAG retrieval chatbot from a concept to a cornerstone of efficiency in our organization. Let’s dive into the world of possibilities and explore how we can optimize our AI assistant to its full potential.

Diagram of Lime RAG Flow

Pre-Processing: Refine

Instead of using the original question to retrieve knowledge — which often leads to numerous inaccuracies directly — it’s better first to refine the question using an LLM to mitigate potential issues. Here’s how to improve the questioning process:

  • Detect Jokes: Ask your LLM whether the question is humorous. If it confirms the question is a joke, do not proceed with it.
  • Context Merge: Present the LLM with the current question as well as several preceding ones from the same user. Inquire whether the question is entirely new or if it needs to be merged with previous ones. If merging is required, ask the LLM to formulate a new question that encapsulates the user’s overarching intent.
  • Entity Extraction: This step is crucial. Have the LLM identify key entities within the question. This prevents scenarios where the user is asking about your generation-4 products, but the knowledge base returns information on generation-3 products due to similarity in questions. By extracting entities such as “generation-4,” the LLM can help target the response more accurately.
  • Language Detection: Request the LLM to identify the language the user is currently employing.

Similarity Search: Retrieve

To enhance the quality of our similarity search, we’ve incorporated the following factors:

  • Fundamental Similarity: This is the degree of similarity between the user’s question and the knowledge available in our database.
  • Entity Similarity: We assess the similarity of entities detected in the question to ensure the documents retrieved are relevant and contain the queried entities.
  • Time Decay: We consider the age of each piece of knowledge, which may be indicated by the creation date of JIRA tickets or the last modification timestamp of documents.
  • User Feedback: User responses to our search results are used to refine the system, with positive feedback increasing and negative feedback decreasing the knowledge’s relevance score.

Be wary of a common pitfall: most Retriever-Augmented Generation (RAG) sample codes instruct you to configure ‘K,’ which determines the count of knowledge pieces retrieved. However, based on our experience, it’s also crucial to set another parameter, ‘min-token,’ which requires the retrieval process to collect at least ‘X’ tokens before stopping. This step is important because if ‘K’ is set to 1 and the retrieval only fetches a piece of knowledge that consists solely of a title from your repository, it would be insufficient for answering questions.

At Lime, we’ve developed our retrieval framework to maintain complete control over all aspects of the retrieval process, instead of using third-party frameworks like Langchain.

Post Processing: Rebuttal

After sending the refined question along with the retrieved knowledge to your LLM for an answer, it’s crucial to be prepared for the subsequent step: Rebuttal.

There are instances where the knowledge sourced may not provide an adequate answer due to it being unavailable in the knowledge base, perhaps because of a technical issue. In such situations, our bot might encounter three specific problems:

  • Failure: If the LLM cannot provide an answer and responds with an “I don’t know” or a similar statement, it’s essential to gracefully handle the inability to generate an appropriate answer. Craft a polite and understanding response to the user instead of displaying the unhelpful answer.
  • Hallucination: Your bot might generate an answer that is unrelated to the knowledge within your domain. It’s a significant concern in current LLMs. To mitigate this, verify the accuracy by checking if the LLM’s answer aligns with the retrieved knowledge. If there’s no correlation, it’s safer to withhold the answer.
  • Tone discrepancies. The LLM might respond with overly technical details to a non-technical user or provide an excessively brief reply. Reassess and adjust the tone of the answer with the LLM, prompting it to rephrase accordingly to meet the expected conversational tone, if necessary.

Lastly, asking the user if the bot’s response met their expectations is important as it serves as the final safeguard. If they give a thumbs up, make sure to enhance the knowledge base for similar future inquiries. Conversely, if the response is unsatisfactory, consider reducing reliance on that knowledge.

Building an AI-driven chatbot is quite challenging. Some organizations opt to fine-tune open-source models, such as Gemma or Llama, we gave up this solution due to the rapid growth of our knowledge repository and the impracticality of daily retraining. Others have suggested using LLMs that can now handle inputs up to 1 million tokens, potentially reducing retrieval efforts. However, this raises issues of increased cost and latency.

We have maintained our commitment to the RAG solution and, after months of persistent effort, we have created a unique process called the Refine/Retrieve/Rebuttal flow. This method may diverge significantly from the workflows found in open-source projects, but it is the result of extensive analysis of hundreds of problematic cases, meticulously considered for weeks.

Finally, imagine that your chatbot is a diligent, tireless entity that thinks quickly but sometimes lacks knowledge and is prone to hallucinations. Constructing a workflow that helps it comprehend questions, locate knowledge, and verify its responses is, without a doubt, your most crucial task, right?

--

--