RAG: Part 7: Evaluation

Mehul Jain
8 min readApr 15, 2024

--

How would you trust your development? When evaluating RAG, it’s essential to consider various metrics that assess both the retrieval and generation components.

Photo by ThisisEngineering RAEng on Unsplash

In this series, we have seen how chucking, embedding, retrieval, and summarization works. In this blog, I will cover various evaluation metrics for retrieval and generation.

But before we look at various metrics, there is a broad evaluation category which will help us understand the selection of the right metrics for a given use case — Online and Offline

Online metrics

These metrics consider user interactions, such as clicks, dwell time, and user feedback. Online metrics are more realistic as they address questions about actual users’ experience with any Information retrieval system. Examples of online metrics include click-through rate (CTR), user engagement metrics, and user satisfaction metrics.

Offline metrics,

These are measured in an isolated environment before deploying a new Information retrieval system. These metrics look at whether a particular set of relevant results are returned when retrieving items with the system. Offline metrics are based on human relevance judgments and are used to predict the system’s performance before deployment. Examples of offline metrics include recall@K, mean reciprocal rank (MRR), mean average precision@K (MAP@K), and normalized discounted cumulative gain (NDCG)

In RAG, evaluation metrics could be used for evaluating retrieval as well as generation.

Retrieval Evaluation Metrics

Retrieval accuracy also known as precision, is a measure of how well any information retrieval system retrieves relevant documents requested by a user. It is calculated as the ratio of the number of relevant documents retrieved to the total number of documents retrieved.

Precision = (Number of relevant documents retrieved) / (Total number of documents retrieved)

Retrieval accuracy is often evaluated in conjunction with recall, which measures the proportion of relevant documents retrieved out of all the relevant documents in the collection.

Recall = (Number of relevant documents retrieved) / (Total number of relevant documents in the collection)

In addition to precision and recall, there are several other evaluation metrics:

  1. Mean Reciprocal Rank (MRR): This metric measures the average reciprocal rank of the first relevant document retrieved for a given query.
  2. Mean Average Precision (MAP): This metric calculates the average precision for a set of queries.
  3. Normalized Discounted Cumulative Gain (NDCG): This metric is a generalization of precision, taking into account the relevance of the retrieved documents and their positions in the ranked list.
  4. Precision at K: This metric measures the precision of the top K retrieved documents, where K is a user-defined threshold.
  5. Recall at K: This metric measures the recall of the top K retrieved documents, where K is a user-defined threshold.
  6. Mean Reciprocal Rank at K (MRR@K): This metric measures the average reciprocal rank of the top K relevant documents retrieved for a given query.
  7. Discounted Cumulative Gain at K (DCG@K): This metric is a generalization of precision, taking into account the relevance of the retrieved documents and their positions in the ranked list, up to the Kth document

Generation Evaluation Metrics

They are used to assess the quality of generated outputs. These metrics are designed to measure various aspects of the generated text, such as fluency, coherence, semantic accuracy, and relevance to the input prompt. Some examples of generation evaluation metrics include:

  1. BLEU (Bilingual Evaluation Understudy): This metric measures the n-gram overlap between the generated text and a set of reference sentences. It is commonly used for machine translation and other text-generation tasks. It is also called the Precision of the generated task.
  2. ROUGE (Recall-Oriented Understudy for Gisting Evaluation): This metric measures the overlap between the generated text and a set of reference sentences, focusing on the recall of the generated text. It is commonly used for summarization tasks. It is also called the Recall of the generated task.
  3. Meteor (Metric for Evaluation of Translation with Explicit ORdering): This metric measures the overlap between the generated text and a set of reference sentences, taking into account the order of the words. It is commonly used for machine translation and other text-generation tasks.
  4. BERTScore: This metric measures the similarity between the generated text and a set of reference sentences using a pre-trained language model. It is designed to capture both semantic and lexical similarity between the generated text and the references.
  5. ParaScore: This metric is an evaluation metric for paraphrase generation that combines the merits of reference-based and reference-free metrics. It specifically models lexical divergence between the generated text and the reference sentences.
  6. PPLX (Perplexity): Perplexity is calculated as the inverse of the geometric mean of the probability distribution over all possible outputs given a particular input. It is commonly used for LLms and is a measure of how well the model predicts the reference sentences.

In RAG or most of the GenAI use cases, the major problem is that we don't have a ground truth.

RAGAS is the solution:

Refer: doc, paper & prompts

It proposes a framework for evaluating Retrieval-Augmented Generation (RAG) models without relying on manual human annotations.

Below metrics don't require Ground truth(GT)

1) Faithfulness

This measures how well the answer matches the information given. It looks at whether everything said in the answer can be found in the original context. The score ranges from 0 to 1, with higher scores meaning better matches. If every claim in the answer can be found in the original context, the answer is considered faithful.

The author used below Prompt:

Prompt 1: Given a question and answer, create one or more statements from each sentence in the given answer. question: [question] answer: [answer]

Prompt 2: Consider the given context and following statements, then determine whether they are supported by the information present in the context. Provide a brief explanation for each statement before arriving at the verdict (Yes/No). Provide a final verdict for each statement in order at the end in the given format. Do not deviate from the specified format. statement: [statement 1] … statement: [statement n]

This ensures that the generated answers remain true to the retrieved context, maintaining consistency and avoiding contradictions.

2) Answer Relevance

This checks how much the answer relates to the question and context. Lower scores mean the answer is less relevant, often because it’s missing details or has unnecessary information. Higher scores mean the answer is more relevant. This metric calculates the average similarity between the original question and several artificial questions created from the answer.

The author used below Prompt:

Prompt: Generate a question for the given answer. answer: [answer]

This requires that the generated answers are directly pertinent to the user’s query, effectively addressing the core inquiry.

Where:
  • Eg is the embedding of the generated question i.
  • Eo is the embedding of the original question.
  • N is the number of generated questions.

3) Context Relevance

This metric measures how relevant the retrieved information is to both the question and context. Scores range from 0 to 1, with higher scores showing better relevance. The retrieved context should ideally only include important details to answer the question. To calculate this, we find sentences in the retrieved context that are relevant to the question.

The author used below Prompt:

Prompt: Please extract relevant sentences from the provided context that can potentially help answer the following question. If no relevant sentences are found, or if you believe the question cannot be answered from the given context, return the phrase “Insufficient Information”. While extracting candidate sentences you’re not allowed to make any changes to sentences from given context.

S is a relevant sentence

4) Aspect Critique

This evaluates submissions based on specific criteria like correctness and harmlessness. The evaluation results are binary, showing if the submission meets the criteria or not.

Users can also define their own criteria:

  1. Bias: This metric assesses the potential for the system to generate biased responses, which can be mitigated through the implementation of mechanisms to handle and mitigate bias in both retrieval and generation processes.
  2. Toxicity: This evaluates the potential for the system to generate offensive or harmful responses, which can be addressed through the implementation of mechanisms to handle and mitigate toxicity in the generation process.

Below metrics require Ground truth(GT)

5) Context Precision

Context Precision measures if all the important information in the context is ranked highly. Ideally, the most relevant parts should be at the top. This metric uses the question, ground truth, and contexts to calculate scores between 0 and 1. Higher scores mean better precision.

6) Context Recall

Context recall assesses how well the retrieved context matches the annotated answer, treated as the correct information. To calculate context recall, each sentence in the correct answer is checked to see if it’s found in the retrieved context.

7) Context entity recall

This metric measures how well the retrieved context recalls entities compared to what’s in the ground truth. It’s useful for fact-based tasks like tourism help desks or historical question answering. It helps evaluate how well the retrieval mechanism finds relevant entities from the ground truth.

GE is ground truth entities

CE is Context entities

8) Answer Semantic similarity

Answer Semantic Similarity evaluates how closely the generated answer matches the ground truth in terms of meaning. Assessing semantic similarity between answers provides valuable insights into response quality. This evaluation uses a cross-encoder model to calculate the semantic similarity score.

9) Answer Correctness

Answer Correctness evaluates how accurate the generated answer is compared to the ground truth. It considers two key aspects: semantic similarity and factual accuracy between the generated answer and the ground truth. These aspects are combined using a weighted scheme to calculate the answer correctness score. Users can also use a threshold value to round the score to binary if needed.

Thanks for spending your time on this blog. I am open to suggestions and improvements. Please let me know if I missed any details in this article.

--

--