Key metrics for evaluating a RAG system

Metrics to evaluate RAG systems adapted to your specific use case

Daniel Puente Viejo
7 min readApr 25, 2024

This article focuses on the key metrics for evaluating a Retrieval-Augmented Generation (RAG) system. To begin, it’s important to define what a RAG system entails.

1. What is a RAG?

RAG or Retrieval Augmented Generation, is a system that stores documents in a database, allowing users to pose questions about them. These inquiries can range from specific details, summaries, and comparisons to even generating new text based on the stored documents.

To illustrate this, consider the following use case: Improving the efficiency with which lawyers search for information. Typically, a RAG system is developed using a database of the company’s legal documents. In this way, instead of spending a long time searching for information in lengthy documents, lawyers can simply consult the RAG system, which presents the precise information they are looking for in a very short time.

RAG systems operate through two primary phases:

The initial phase involves ingesting and processing data into a vectorized database. This process begins with collecting files and partitioning them into smaller chunks. Subsequently, embeddings are generated for each chunk using various models. These embeddings, along with the text and any associated metadata, are then stored in the vectorized database. Although the overview of this process may seem straightforward, it involves numerous detailed tasks related to preprocessing and chunking to ensure the database is professionally structured. The effectiveness of a RAG system is significantly influenced by how well this initial step is executed.

Following the data ingestion, users have the capability to ask questions about the documents. At this stage, a search configuration is established, detailing the search algorithm, the maximum number of chunks to consider, or whether to incorporate metadata. All the relevant context identified during the search is then passed to a Large Language Model (LLM), such as GPT, along with the user’s initial query. The LLM utilizes this information to generate a response to the user’s question. The functionality and workflow of this second phase can be illustrated through a specific diagram.

The quality of the response generated by the RAG system hinges on several critical factors, particularly the following:

  • Quality of Original Files and Preprocessing/Chunking
  • Search Strategy used
  • Large Language Model (LLM) used to ask the question

2. Metrics

Improving the performance of the RAG system involves considering numerous factors. A straightforward yet effective method is to have a comprehensive and diverse database of questions related to the specific use case, coupled with a business team rating the quality of the answers generated by the system. However, this approach can be quite resource-intensive, as the business team would need to re-evaluate the responses for every single modification made to the system. Therefore, while this evaluation step is crucial, it’s necessary to find a quicker method to measure whether the system’s performance is improving or declining with each update. To facilitate this, two types of metrics are essential to monitor.

2.1 Dataset

Nonetheless, before starting with the explanation of the metrics is very important to have an evaluation database with, at least, the following features: (i) question, (ii) expected answer, (iii) files where the answer is found, and (iv) pages of the files where the answer is found. Example:

Having a high-quality dataset, enriched with a variety of questions and sources verified by the business team, is crucial. The integrity and diversity of the dataset directly influence the reliability and accuracy of the outcomes.

2.2 Search Metrics

A crucial step in accurately responding to queries involves identifying the relevant documents and pages that contain the answers. When the correct sources are identified, the likelihood of accurately answering the question increases significantly. To facilitate this, three key metrics have been developed:

  • Docs Precision (Search)
  • Pages Precision (Search)
  • Positional Docs Precision (Search)

The first two metrics are similar, with Docs Precision relating to the identification of entire documents, and Page Precision pertaining to specific pages within these documents. Page Precision is particularly critical as it is directly related to obtaining accurate answers to queries. Document Precision, while also important, mainly indicates whether the relevant document has been located. These metrics essentially measure the proportion of relevant documents or pages identified, underscoring the importance of a thoroughly vetted dataset. For example:

For example, in the image above we can see that for a question asked, the 2 documents where the answer is found have been obtained. However, only 1 of the 3 pages has been located. In this situation, it can be interpreted that the model will probably not answer the question perfectly as it has not located all the relevant/necessary information.

The third metric, Positional Docs Precision, refines the first by considering the location of content within the documents. It highlights the importance of the position in which the document is found, as finding a document at the beginning of a search result is valued more highly than finding it towards the end. To account for this, a penalty factor greater than 1 is applied, in this instance, 1.1 is used. This leads to the selection of a vector with negative values, each 1.1 smaller than the previous one. Imagine searching 10 chunks and needing to locate 2 documents; discovering a document adds 0.5 (1/2) to the metric.

As can be seen in the diagram, +0.5 is added if the file is found, and a penalty is added otherwise. The longer it takes for the documents to appear, the more penalties they will incur. The penalty vector will add -1, while each correct document found will add proportionally to the number of sources required to be found, ultimately adding up to 1. Therefore, the metric is in the range [-1, 1]. In the end, it is normalized to have a metric in [0, 1] range.

The idea behind this metric is to give another interpretation to the Docs Precision metric, which in some cases gives 100% accuracy but does not give full information on the quality of that search.

2.3 Answer metrics

Typically, locating the source greatly increases the chances of accurately answering the question. Nonetheless, this isn’t a guaranteed outcome, necessitating an alternative evaluation method. Moreover, employing more chunks will elevate the metrics, though this doesn’t always translate to better performance. Despite LLMs having the capacity to process extensive contexts, an overload of information can actually degrade response quality. Additionally, both the search process and response times are likely to lengthen with increased information.

Evaluating answer metrics is more challenging due to their non-deterministic nature, which doesn’t assure enhanced performance. We will introduce five metrics, and if a majority shows improvement over prior iterations, it suggests a likelihood of better outputs. However, this should always be confirmed with the business team. The metrics are as follows:

  • Docs Prevision (Answer)
  • Pages Precision (Answer)
  • Cosine
  • Hallucination
  • Evaluation 0/5

The operation of the first two metrics aligns with the previously described methodology, with a key difference being the use of answer references instead of all context segments. This requires providing a clear prompt to the LLM, instructing it to always cite the sources of its assertions in a specific, parseable format to minimize fabrications.

The Cosine similarity metric compares the embedding of the expected response with that of the LLM response. Although this metric does not always provide accurate information on its own, aggregating the average values of several instances can provide a rough estimate of the quality of the response.

Another metric assesses the presence of hallucinations in a binary form, which requires a very precise indication of the LLM. This involves presenting the context and the response generated by the LLM and specifying that if the response provided cannot be logically deduced from the given context, the metric must return a value of 1, regardless of the correctness of the response. This is because a correct answer that could not be deduced from the context provided is still considered a hallucination. The use of this strict criterion is intended to virtually eliminate hallucinations.

Let's remember that there are 2 kinds of errors. Examples:

The first type of error is particularly perilous because a user unfamiliar with the subject might assume the incorrect answer is accurate. In contrast, the second type of error is easier to handle, as it becomes apparent that the model has misunderstood the query. By adopting this strategy, we aim to minimize the occurrence of the first.

The last metric is straightforward: it involves requesting the LLM to rate the provided answer on a scale from 0 to 5, based on the expected response and a set of predefined criteria for evaluation. Remember that the rules for giving one metric or another must be precisely specified and adapted to your use case.

3. Conclusion

In summary, this article outlines a series of metrics that can be applied to the RAG project. While these metrics do not always ensure improved performance, they provide insights into the model’s behavior, allowing us to assess, with a high degree of confidence, whether the model’s performance has enhanced compared to its previous iteration, without the need for the business team to review every single response.

Thanks for Reading!

Thank you very much for reading the article. If you liked it, don’t hesitate to follow me on Linkedin.

#deeplearning #machinelearning #llm

--

--