Under the Hood of Farmer.chat: Journey to an optimised, production ready RAG powered Chatbot

Published in

digitalgreen-techblog

13 min read5 days ago

In our previous article, we gave detailed description about functional blocks of Farmer.chat, an AI-powered chatbot designed to bridge the information gap for extension workers and farmers. We highlighted the value Farmer.chat delivers by providing timely and accurate agricultural knowledge in a user-friendly conversational format. In this article, let’s delve deeper into the journey of creating a production-ready RAG-powered chatbot for Farmer.chat. While there are excellent resources on Medium and all across the web on RAG and advanced RAG techniques, there are few resources which provide practical case study on how to evolve into production ready systems solving real business problems. We hope this provides a case study for the AI community and helps them adopt some of the important points.

The perspective for the readers is the way to build technical blocks based on the user needs rather than a prescriptive RAG pipeline.

Why did we chose RAG to start with?

We get this question a lot of times: why we chose RAG and why not a fine tuned model. Although it has been debated at various forums and papers it is worthwhile to spend some time and relate it to the business problem itself. As mentioned in the previous article, the main challenges of agriculture extension in nutshell are:

Un-organised information: generic and staggered information makes finding the right answer difficult.
Context matters: Farming success depends on local factors and agriculture domain knowledge.
Accuracy is key: Bad advice can hurt a farmer’s livelihood and trustworthy information sources are a must.

This article last year on openai gives an interesting quadrant approach:

The article puts it out clearly:

Context optimisation: You need to optimise for context when the model lacks contextual knowledge because it wasn’t in its training set or it requires knowledge of proprietary information. This axis is where you need RAG.
LLM optimisation: You need to optimise the LLM when the model is producing inconsistent results with incorrect formatting, or the tone or style of speech is not correct. This axis is where you need LLM fine tuning.

There is an in-depth study done in the agriculture domain by the Mircrosoft team. The key conclusions are covered from tables 18–20 in this paper.

A key observation is that a base model like GPT4 + RAG is almost as good as a fine tuned base model.
Delta improvement is observed with fine tuned base model + RAG but not consistent.

RAG is cost effective as compared to fine tuning. Fine tuning + RAG may be slightly beneficial but if we want relatively good contextual answers, developing a RAG powered chatbot is a faster and cost effective.

Taking this decision early helped immensely as we will see in the article further. These early decisions are key to success in your technology and product roadmap.

Note: we took this decision way back in June 2023 while these in depth study and reports are mostly from the end of 2023 to beginning of 2024.

In the initial days it is not important to back your decisions with a very thorough study but quick indicators. Validation from studies by others is a further reinforcement and if it is not in line, one should reconsider the decision.

Agile approach: Build — Test — Deploy — REPEAT

We adopted a rigorous “build-test-deploy” cycle within an agile development framework to optimise Farmer.chat. This iterative approach allowed us rapidly prototype, test, and refine the chatbot, ensuring continuous improvement.

It is important to understand that RAG, though it sounds simple, has many moving parts and a lot of techniques to improve each of those moving parts. In real world development, we don’t have the luxury of implementing a super refined RAG pipeline in one go. One needs to start simple and focus on one thing at a time.

One needs to know which part to focus on first to maximise the improvement to effort.

These decisions are not straightforward and depend on your team structure, strength, type of content, business problem at hand and many factors. Here we present our journey starting with a small team how we upgraded each version and gradually increasing complexity.

Version 1: 2 — week sprint

Our version 1, almost a year ago, was just two week sprints. Agriculture is a vast subject, we focused on one or two crops in one region and a curated set of questions (50 odd) for those crops for evaluation.

Key decisions

Messenger platform : we developed bot for two messenger channels, Whatsapp and Telegram. Reason for choosing messenger applications was ease of onboarding users and familiarity of the users with the interface. While WA is more widely used, telegram offers simpler APIs to integrate, is free and provides better user experience for multi-modal content. Getting users to install telegram was a one time effort.
Langchain’s Power : We opted for Langchain, a popular open-source library. Langchain provides powerful tools for building LLM enabled apps using ready to chain of thoughts, agents and configurability to call different LLM or embedding models. Langchain provided the ConversationalRetrievalChain class well suited for RAG.
ChromaDB for Knowledge Storage : We selected ChromaDB as the vector store. We considered pinecone, but chromadb satisfied the need to have option of self hosting.

Prompt engineering played a crucial role in ensuring that the bot doesn’t answer and start conversing on topics outside agriculture domain. It performed reasonably on the test questions for the crops.

The test in this version was about the ability to answer agriculture related questions and not on the completeness, relevance and accuracy.

Here is the information flow of the ConversationalRetrievalChain class. Please note that this class is now deprecated in langchain and we moved away from langchain after our first version.

Issues with version 1

Hallucination: While this version didn’t answer out of context questions (not related to agriculture), it did answer the questions which were related to agriculture and the crop but were not covered in the content.
Limitation with langchain: while langchain is an excellent starting point, we realised some limitations like:

we couldn’t log the rephrased query and we really couldn’t see if it is adding some keywords that cause issues in retrieval
retrieval method (at least at that point) had two options and in our production environment, the retrieval method that worked had opposite sense to cosine similarity!!
we found that retrieval method itself when done using directly chroma client provided different and better results as opposed to the one through langchain

3. Chromadb in memory problem: while chromadb is easy to set up, AI native and fast, it is in memory and as the number of bot instances in different regions increased, the cloud compute requirement grew exponentially.

While hallucination is a business problem the other issues were more technical in nature. However, both were intertwined in the sense that langchain’s limitations didn’t allow us to properly address the issue of hallucination.

As a result, we decided to move away from langchain to get better control.

Please note: this is not to suggest that langchain is bad. On the contrary, we will advise anyone to use langchain or llamaindex if that satisfies your business needs but suggest testing it thoroughly.

Version 2: focus on fixing hallucination

We set up the same flow but with our own calls to LLM and control over the prompts in rephrasing the query based on history as well as generation. Once we got the rephrased query and retrieved chunks, we noticed the cause of hallucination.

Hallucination in RAG occurs when irrelevant text chunks are passed in generating the answer.

It is a well known little quirk of LLM. It is good in following positive instructions but does not properly respect negative instructions, for instance in this case — do not generate answers if the chunks in context are not relevant to the question is not respected all the time.

We looked at each of the moving part and first ruled out what not to focus on:

Embeddings: the available embeddings were better but delta improvement was really small as reported on MTEB leaderboard
Chunking strategy: although it has significant impact as we see later, there were too many variations and not clear where to start
Query decomposition: it does maximise the retrieved text chunks but logically, it will lead to completeness of answer rather than improving the hallucination as the irrelevant chunks would remain as is
Vectordb: it was easy to do quick trials and check that there would be some improvement but not much and it is more related to performance
Augment knowledge base: quick trials revealed that the GPT models were themselves not providing hallucination free question answer pair and generating the same was not that easy

We decided rather than more upstream steps, focus on the downstream. In essence, we want to maximise text chunks with relevant information to be passed on for generation and minimise text chunks with irrelevant information. Why not just add a filter and rank the retrieved text chunks rather.

Though there are many algorithms, we leveraged LLM itself to reflect and reason through and filter/ rank chunks. The key to evaluate the efficacy was to look at the retrieved chunks for each question and identify which is completely relevant and which is completely irrelevant manually and verify if the results match with LLM.

With LLM based filtering and ranking, we were able to correctly filter out 98.3 % of irrelevant chunks on test data.

This led to overall improvement in completely factually correct responses from ~ 76 % to ~ 100% on the test data of 30 odd questions.

We also changed chromadb in-memory to chromadb client. Here is what the new flow looked like:

Issues with version 2

While chroma db client worked, it is a patch and chroma still doesn’t provide elegant way and we had to look for other options.
high faithfulness -> high unanswered questions: what we were able to answer was correct and grounded in the content but it meant a lot of questions now got unanswered !!

Version 3: fixing vectorstore db

As we had got a positive response initially from the bot, we were required to scale it in different geographies and larger, more complex content. This meant vector store db in chroma which was selected with minimal testing needed thorough analysis.

This section can be a bit technical and therefore we have provided the complete details here. We present the conclusions and results here.

Benchmark Result Analysis

The detailed result sheet can be found here Vector DB Benchmarking The important points observed from the result sheet are here,

Qdrant DB is the best performing among all, it has the highest QPS, Highest Recall@100 (at efConstruct 256)
Vespa and ElasticSearch db are more refined on the memory utilization side, however their latency is on the higher side.

Here is the revised information flow for version 3.

Version 4: fixing unintentional unanswered questions

As we enhanced our RAG pipeline, we focused also on creating an auto evaluation pipeline which was benchmarked against human evaluation. Our user research team and product — program teams pointed out that inability to answer questions causes drop off. This was corroborated with the data but what came out further was that around 11% of unanswered questions were the users asking about a different crop than what they have selected and 2–3 % were further remarks (welcome message, exit) which got converted to some question because of chat history. This also resulted in more of a question answering machine than a conversational flow.

We decided to have a user intent based flow so as to create conversational flow as well as route to the RAG only when the intent is farming related.

Tests showed that chatgpt-4 achieved 100 % accuracy in user intent classification sample with few shot learning.

Achieving near 100 % is important because if this is wrong, it can lead to more absurd answers.

This was quickly rolled out and led to change in the information flow as below:

Version 5: retrieval error due to common words

With more deployment and usage, there were some issues reported where users’ questions remained unanswered even though it seems correct.

A closer inspection of retrieved chunks revealed that certain words which are very frequently used in some text chunks and used in query cause totally irrelevant chunks to be retrieved. Example, what are common varieties of coffee? VS what are common varieties of coffee in Kenya?

In this particular case, there are document sections where Kenya comes too often or a phrase which is realted to Kenya like KES. This resulted in irrelevant retrieved text chunks which were rejected by filtering and ranking.

The fix to this we added another LLM call to shrink and simplify the query to basic theme for retrieval.

This added another block in version 5.

Version 6: Semantic chunking

One of the key limitations with RAG is you can know if the answer is correct by looking at the text chunks passed at generation step but it is hard to ascertain if the answer is complete because there could be text chunks that have been missed.

Plainly speaking:

We can be sure that whatever answered is correct but can we be sure that whatever can be answered is answered.

The latter depends on the context recall which can vary depending on the content and query. Context recall measures what percentage of relevant text chunks are retrieved on an average. To calculate the same, one needs to know how many relevant chunks are there in the first place.

One thing that we considered initially and literature suggests is having a good chunking strategy. We had quickly implemented linear chunking and though it works, it is not the best way. What if we can group together chunks based on topics so that if that topic is asked about, the entire chunk gets extracted. There is a great jupyter notebook describing five levels of chunking by Greg Kamradt. We request reader to read through the notebook and watch the associated video. We decided to implement semantic chunking (level four). This approach doesn’t rely on LLM but just on calculating the cosine similarity between the adjacent sentences.

Intuitively, semantic chunking is simply saying that if the topic changes the similarity distance between new sentence will see a sudden jump with the previous sentence(s).

This makes sense. However, we had to split the tables and images. We had tried out grobid earlier which is widely used for scientific documents but it didn’t work on our content. We found untructerd-io as way better and though it misses out on captions of the images or tables, it is able to reasonably extract the text.

Having a good pdf parser suitable for your content is necessary for semantic chunking.

Once we were able to do semantic chunking, we used it to create chunks and their respective topics.

This resulted in yet another change in the information flow of RAG.

It is important to understand shift in the evaluation beyond RAGAS

Our focus was earlier on factual correctness and relevance. Now, let us assume that there are certain questions from the users which can be answered and certain questions that can’t be answered. And let us assume there are some factually correct answers and factually incorrect answers. This creates a two by two matrix as below:

If we are looking at RAGAS framework on faithfulness and relevance, we are trying to maximise the upper half while minimise the lower half.

What if we change the y axis of the quadrant as is answered and is not answered, creating a revised confusion matrix for RAG as below:

In this revised approach, we maximise the first and third quadrant such that it is factually correct. This provides a clear definition of hallucination and retrieval error if what can be answered is answered provided it is factually correct.

It is important to relook at your evaluation criteria as per the business needs.
Our overall metric became multiplication of percentage faithful sentence and percentage answered.

With semantic chunking we were able to reduce the number of unanswered questions to nearly half.

Our journey: In a nutshell

Our journey involved multiple version iterations, each focusing on specific optimisation goals:

Version 1: A rapid prototype built using LangChain to demonstrate core functionalities. While functional, it highlighted limitations in complexity.
Version 2: Focused on addressing “hallucination” (generating irrelevant responses). We moved away from LangChain.
Version 3: Optimised the Vector Database (Vectordb) for efficient information retrieval.
Version 4: Streamlined user intent flow routes to minimise unanswered questions. This improved user experience as well.
Version 5: Implemented “shrink query for retrieval” to handle complex or edge-case queries.
Version 6: Employed semantic chunking to group sentences realted to topics together, reducing unanswered questions.

The road ahead

Our work on Farmer.chat continues. In the next article, we’ll explore how we employed automated evaluation techniques to accelerate the iteration process and further optimise Farmer.chat for real-world impact.

We currently have many LLM calls and there are ways we have optimised it for latency in the acceptable range. There are further enhancements under way to decrease latency and cost. We would be exploring the same in detail in future articles. Stay tuned.

Gratitude

We are grateful to our donors and projects which made this work possible, in particular AIEP and GAIA project by Bill and Melinda Gates Foundation and GIZ.