Structuring Retrieval Augmented Generation (RAG) Projects

Leonardo Schettini
Crayon Data & AI
Published in
7 min readJun 28, 2024

In the course “Structuring Machine Learning Projects”, Andrew Ng presents a chain of assumptions that we want to hold true in order to ensure that Machine Learning (ML) systems will perform well in the real-world. These assumptions are the base of classical ML projects and directly define how the models can be evaluated and improved.

When working with LLMs, the process is different from classical Machine Learning projects mainly because for LLM-based systems, in special RAG system, we primarily leverage pre-trained models, which means the typical training phase isn’t the focus. In any case, principles and concepts from traditional Machine Learning methods can still be applied for LLM-based systems.

In this article, we’ll go over the chain of assumptions and related concepts presented by Andrew Ng, and make a parallel for how the same ideas can be applied for Retrieval Augmented Generation (RAG) systems.

Ortogonalization

Chain of assumptions in ML

Inspired by the chain of assumptions in ML, presented by Andrew Ng, we can think of a set of assumptions that if true, will guarantee (to a certain extent) that an RAG system performs well.

  1. Data extraction must be correct (OCR, reading order)
  2. Data retrieval must be specific (data chunking, knowledge graph — retrieval relevancy)
  3. The LLM uses the retrieved data effectively (prompting — groundness of the answer)
  4. The answer aligns with user intent (prompting — answer relevancy, no hallucination)

Knobs for RAG system

For each of these assumptions in a RAG system, there are several “knobs” or factors that can be adjusted to improve the system’s performance. Here’s a breakdown:

1. Correct data extraction:

  • OCR system: The OCR system could be improved or replaced if it’s not accurately extracting text from documents.
  • Reading order algorithm: If the reading order of the extracted text is incorrect, the algorithm determining this order could be adjusted or replaced.
  • Data extraction augmentation: Extracting insights and enriching data as it is processed is especially useful when building property knowledge- and property graphs. This can improve the relevancy of the retrieved context to the user question and thus improve the answer quality.

2. Specific data retrieval:

  • Retrieval algorithm: The algorithm that determines which data chunks or knowledge graph nodes to fetch could be improved or replaced.
  • Data chunking process: If the data chunks being retrieved are too large, too small, or not relevant enough, the process for determining these chunks could be adjusted.
  • Embedding model: One can select an embedding model that is better suited for the task at hand. The MTEB Leaderboard ranks embedding models across different tasks.

3. Effective use of retrieved data by LLM:

  • Prompting process: If the model isn’t effectively using the retrieved data, the prompts given to the model could be adjusted.
  • Fine-tuning: The model could be fine-tuned on a dataset that’s more similar to the data it will be working with in production.
  • Use agents or different workflows to handle user messages: Without going into the complexities of agent-based systems, it’s possible to run different workflows depending on the user’s intent (e.g.: normal question and answering versus summarisation).

4. Alignment of answer with user intent:

  • Query augmentation: If the model isn’t understanding the user’s intent, the user’s query could be augmented with more context.
  • Model choice: If the model is consistently misinterpreting user intent, a different model that’s better at understanding user intent could be used.
  • Breakdown of problem into sub-tasks: Alternatively, if the model is having trouble understanding the user intent, we may want to split requests into smaller steps.

You may have noticed that the assumptions, and consequently the knobs, that appear first can have a direct impact on the points that come after. For example, a better OCR model will enable better retrieval and answer generation, the same way a different chunking strategy will affect not only retrieval but also the answer generation. Similarly, errors at earlier processes will affect later ones. For this reason, it is important to have an evaluation setup that will help identify what components of the system needs improvement

RAG system evaluation

Following the chain of assumptions for RAG systems, we want to have an evaluation setup capable of validating that the assumptions do hold true.

Besides the first assumption, Correct data extraction, that primarily involves the OCR model as well as other algorithms for extracting additional information, the following three assumptions are tightly related with the runtime performance of the system. More specifically, these assumptions are related with the 2 main components of RAG systems: the retriever and answer generation. It’s important to evaluate these components individually, especially because the output of the retriever directly influences the correctness of the generated answer, and the alignment with the original user’s intent.

RAG triad

The RAG triad as first introduced by TruLens, defines three evaluations that measure the relationship between the retrieved context, generated answer and user intent. In other words, each evaluation aims to answer the question of how relevant one element is to another.

  • Context relevance: is the retrieved context relevant to the user intent?
  • Groundedness: is the generated answer supported by the context?
  • Answer relevance: is the generated answer relevant to the user intent?

Different evaluation frameworks will use different terms for these concepts. For example Ragas that uses Context precision, context recall and answer relevance.

Single number evaluation metrics

As rightfully suggested in the lesson “Single number evaluation metric”, ML systems benefit of a single evaluation metric as a single metric makes it straightforward to compare and rank different models, to ultimately select the best.

For RAG systems, especially when following the RAG triad, it is hard to prioritise one evaluation metric over the other. For some systems, it may be valid to consider one of the metrics from the RAG triad as the optimising metric and leave the others as satisficing metrics. Concepts also presented in this other lesson from the course. Another option is to calculate a (harmonic) mean of all three values. This is done by some evaluation frameworks such as RAGAS.

When implementing complex or advanced RAG techniques, it is also important to monitor the cost and time to process users’ messages. These factors can impact the practicality and scalability of the system. For instance, a highly accurate retriever that takes an excessive amount of time to fetch relevant data may hinder the overall user experience, turning a good system into a bad one. With that, having cost and processing time as satisficing metrics is a good idea.

Independently of the selected approach, having the individual performance for each element of the RAG triad is essential to prioritise fixes and improvements to the RAG system.

Error analysis

Another interesting concept presented on Andrew Ng’s course, specifically in this lesson, is the Error analysis, which is the process of manually examining the system’s mistakes. For a RAG system especially, this process can also give insights on how users are actually using the system, which can then help guide task prioritisation.

On traditional software engineering, errors and bugs are sometimes fixed in a FIFO manner. Especially when the product is still in development and most errors have similar priority. For a Machine Learning system, it’s best to evaluate multiple ideas for improving the model in parallel. To do so, we should evaluate and count the system errors and find what are the most prominent issues, that would have the biggest impact if solved. SME can help identifying errors that happen more commonly, helping us to focus on the right issues.

A way to carry out the error analysis is to create a table and go through data samples manually:

Example of error analysis

It is beneficial to also analyse samples that are considered correct, as they might also require fixing. In any case, the error analysis gives us a “ceiling” of how much the system performance can improve when fixing each of the errors we found.

Projects that have a model training phase should perform the error analysis both on dev and test datasets. However, considering that LLM-based systems, in particular for chat-based systems, often do not require training, we often lack a dataset that can be used to evaluate the system performance and perform the error analysis. There are two main solutions:

Generate a synthetic dataset with LLMs

The benefit is that we can generate a high volume of data, that can provide more reliable evaluation metrics. On the other hand, LLMs are also often used as system evaluators, meaning we must be careful with the bias these models carry to their own output.

Collecting usage data from users

While it can be expensive to collect large volumes of data from real users, this type of data gives us direct insight into how the system is being actually used. Moreover, getting Subject Matter Experts (SMEs) involved into the improvement phase allow us to leverage their expertise, making it easier to identify subtle mistakes.

To collect usage data, we can record users’ sessions and implement a feedback mechanism. In both cases it is beneficial to record data from different steps of the system workflow. Fine-grained usage data becomes even more important when implementing advance RAG techniques, such as multiple retrievers, re-ranking models, agents and etc.

Final thoughts

LLMs, and pre-trained models great contribute with democratisation of AI. Deep and specialised knowledge is no longer required to use state-of-the-art models. On the other hand, it has become harder to identify promising techniques for improving LLM based systems, which in turn transforms the improvement process into guess work. With that in mind, it is important to maintain a more structured approach for how we evaluate the system and prioritise improvements.

--

--