Evaluation of RAG Metrics using RAGA

Harika Samala
Walmart Global Tech Blog
10 min readJul 5, 2024

In the AI domain, Large Language Models (LLMs) are hogging the limelight. These innovative marvels serve as the intelligence core for sophisticated chatbots and a range of applications in natural language processing (NLP). LLMs can answer user questions in various contexts by cross-referencing authoritative knowledge sources.

This blog explores the challenges encountered by LLM and proposes a solution to mitigate them by introducing the RAG Technique. We will delve into the specifics of metrics used to evaluate the effectiveness of the RAG system, employing RAGAs (Retrieval Augmented Generation Assessment) for analysis.

LLM Limitations:

The inherent nature of LLMs introduces unpredictability in LLM responses. Additionally, LLM training data is static and introduces a cut-off date on the knowledge it has. Therefore, this leads to challenges as elaborated below.

  1. Presenting false information when it does not have the answer.
  2. Presenting out-of-date or generic information when the user expects a specific, current response.
  3. Creating a response from non-authoritative sources.
  4. The LLM has a capacity to accept only a restricted number of tokens, meaning it cannot accommodate the entire knowledge base as context.
  5. Generating erroneous responses caused by confusion in terminology, where various training sources employ the same terms to describe different concepts.

Overcoming Limitations of LLM With RAG:

The constraints mentioned earlier can be resolved through the implementation of Retrieval Augmented Generation (RAG).

RAG redirects the LLM to retrieve relevant information from authoritative, pre-determined knowledge sources. Organizations have greater control over the generated text output, and users gain insights into how the LLM generates the response.

RAG can be beneficial in various tasks like question answering, dialogue generation, summarization, and more NLP use cases.

RAG

Meta AI researchers introduced rag method to address knowledge intensive tasks

It is a method that combines an information retrieval component with a text generator model to address knowledge-intensive tasks. Its internal knowledge can be modified efficiently without needing to retrain the entire model.

This blog post delves into the evaluation of RAG model using metrics from RAGA library, shedding light on its components and practical application

Rag Components

Indexing Stage

Image Reference

Indexing stage is the first phase where data is appended into vector store which contains text, metadata, and embeddings.

Querying Stage

Querying is the next stage, where the system uses the search term to scour the vector database for the most relevant chunks related to the question, which are then passed to the LLM. This ensures that the LLM responds by considering the context information.

Metrics:

RAG Metrics can be classified into components each serving a distinct purpose:

1. Generation Metrics:

Metrics for evaluating how well the model generates response taking into account of retrieved information and original input

a. Answer Relevancy: How relevant is answer to our question

b. Faithfulness: Factual consistency of generated answers with context

2. Retrieval Metrics:

Metrics for evaluating if the retriever was able to find passages of text that contain information related to given input or prompt

a. Context Precision: In simple terms how relevant is context retrieved to the question asked.

b. Context Recall: Is the retriever able to retrieve all the relevant context pertaining to ground truth.

3. Comprehensive Metrics:

From the aforementioned points, we derive metrics at the component level. However, for a comprehensive evaluation of the entire model from start to finish, assessing Answer Correctness and Answer Similarity provides a more offers and a thorough perspective.

A tabulated breakdown of input expectations for each metric is provided below, facilitating a better understanding of the evaluation criteria.

Practical Application using FiQA (Financial Question Answering) Dataset:

To gain a hands-on understanding of these metrics, it is crucial to practically observe their performance on a dataset. Ragas and its community have widely recognized FiQA as a standard introductory test dataset. Hence, we utilise The Financial Question Answering (FiQA) dataset, tailored for the financial domain, provides a rich source of diverse examples for thorough evaluation.

We chose this dataset for following reasons.

1. It contains highly specialized financial knowledge unlikely to be present in the training data of GPT models.

2. It was initially designed to assess Information Retrieval capabilities and thus provides well-annotated knowledge snippets that serve as standard answers (ground truth).

3. A sub dataset called ragas_eval is a part of FiQA dataset which contains contexts field. This field is mandate for obtaining metrics like context recall and context precision. Hence, we utilise ragas_eval for this observation.

In this analysis, we specifically utilize the “ragas_eval” subset, comprising context, answers, questions, and ground truth.

Example

Delving into technical intricacies of how ragas compute each metric is discussed in detail

Answer Similarity:

Input: Generated Answer & Ground Truth

Scores the semantic similarity of ground truth with generated answer.

Technical Definition:

Matrix Multiplication between for ground truth embeddings and answer embeddings.

Default embeddings: Open AI embeddings (hugging face embeddings are suggested for best results)

similarity = Matrix Multiplication (embedding_1_normalized, embedding_2_normalized. T)

Default threshold: 0.5

Model : Cross Encoder (we pass both pair of sentences simultaneously to the Transformer network unlike biencoder where we pass separate sentences)

If the embeddings are in proximity answer similarity is higher.

Answer Correctness Using RAGA:

Input: Generated Answer & Ground Truth

Accuracy of the generated answer when compared to the ground truth.

Ragas pass an instruction to LLM under hood and then process the calculation as required. Following is the instruction for answer correctness metric.

Technical Definition:

For Low answer say these might be the extracted statements

  • TP: [Einstein was born in 1879]
  • FP: [Einstein was born in Spain]
  • FN: [Einstein was born in Germany]

Formula:

The F1 score, representing factual similarity, is calculated using the formula:

Here TP, FP and FN are absolute numbers of statements obtained for true positive from above instruction. Hence F1 score is completely dependent on the statements obtained through llm instruction.

Final Score Calculation: The overall score is computed using a weighted average of the F1 score and a similarity score which we obtained already from above mentioned answer similarity metric.

This meticulous approach ensures an assessment of answer correctness, considering both factual accuracy and overall similarity.

Example 1:

Example 2:

Answer Relevancy:

Input: Question & Generated Answer

How relevant is the generated answer to the prompt

Technical Definition:

For answers that are classified as “do not know,” the non-committal value is set to 1. In the calculation of the score, we multiply the non-committal value by the inverse of 1 (0), which results in a complete score of 0. This scoring system is based on the idea that if an answer accurately answers a question, it is highly likely that the original question can be reconstructed solely from the answer. Therefore, a score of 0 indicates that the answer is not relevant to the question.

Example 1:

Example 2:

Faithfulness:

Input: Generated Answer & Context

Factual consistency of the generated answer against the given context.

Technical Definition:

Step 1: Break the generated answer into individual statements.

Statements:

  • Statement 1: “Einstein was born in Germany.”
  • Statement 2: “Einstein was born on 20th March 1879.”

Step 2: For each of the generated statements, verify if it can be inferred from the given context.

  • Statement 1: Yes
  • Statement 2: No

Step 3: Use the formula depicted above to calculate faithfulness.

Example 1:

Context Recall:

Input: Ground Truth & Context

If each statement from ground truth can be found in the retrieved context.

Technical Definition:

Step 1: Break the ground truth answer into individual statements

  • Statement 1: “France is in Western Europe.”
  • Statement 2: “Its capital is Paris.”

Step 2: For each of the ground truth statements, verify if it can be attributed to the retrieved context.

  • Statement 1: Yes
  • Statement 2: No

Step 3: Use the formula depicted above to calculate context recall.

Example 1:

Example 2:

Context Precision

Input: Ground Truth & Context

whether all the ground-truth relevant items present in the contexts are ranked higher or not. All the relevant chunks must appear at the top ranks

Technical Definition:

Step 1: For each chunk in retrieved context, check if it is relevant or not relevant in the ground truth for the given question.

Step 2: Calculate precision @k for each chunk in the context.

Step 3: Calculate the mean of precision@k to arrive at the final context precision score.

Example 1:

Example 2:

RAGAS Implementation:

Loading Libraries and Initial Setup

We need to import few libraries and set up keys for implementing RAGAS. We are using Azure Open AI for llm setup as below

import os
from ragas import evaluate
from datasets import Dataset,load_dataset
from langchain_openai.chat_models import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings
os.environ['OPENAI_API_KEY'] ='****************'
azure_model = AzureChatOpenAI(
openai_api_version="***********",
azure_endpoint="**************",
azure_deployment="gpt-35-turbo-test",
openai_api_type='azure',
validate_base_url=False,
)
# init the embeddings for answer relevancy, answer correctness and answer similarity
azure_embeddings = AzureOpenAIEmbeddings(
openai_api_version="******",
azure_endpoint="********",
azure_deployment="text-embedding-ada-002")

Preparing Dataset for Evaluation

data_samples = {
'question': list(df['question']),
'answer': list(df['answer']),
'contexts' : list(df['contexts'].tolist()),
'ground_truths': list(df['ground_truths'].tolist())
}
dataset = Dataset.from_dict(data_samples)
rag_df= pd.DataFrame(dataset)
rag_eval_dataset = Dataset.from_pandas(rag_df)

Implementing RAGAS Evaluation

result = evaluate( rag_eval_dataset,
metrics= [
answer_correctness,answer_similarity,
answer_relevancy,faithfulness,
context_recall,context_precision
],
llm=azure_model,embeddings=azure embeddings
)

Once we do result.to_pandas() we obtain a data frame of metrics for each row.

Analysis of over all metrics

  1. Context Recall (0.78):

Clarification: This metric assesses whether the retriever can retrieve the necessary context for a query. If, for instance, only half of a question is answered, it may lead to a lower recall score.

2. Context Precision (0.6):

Interpretation: The retriever is reasonably good at selecting relevant information from the context, although it may not be perfect.

Clarification: This metric gauges how well the retriever can prioritize and select relevant content from the context. A score of 0.6 suggests it is more often right than wrong.

3. Answer Relevancy (0.75)

Interpretation: Answers are generally relevant, but there might be some deviations or less-focused responses.

Clarification: This metric assesses how well the generated answers align with the context. A score of 0.75 suggests generally good relevancy but with some room for improvement.

4. Answer Similarity (0.88):

Interpretation: A high score indicates that the generated answers closely resemble the expected answers, displaying a good understanding of the query and context.

Clarification: This metric assesses the similarity between the generated answers and the expected ones. A high score implies a strong understanding and alignment.

5.Answer Correctness (0.48):

Interpretation: While some answers are correct, a significant portion may not be entirely accurate.

Clarification: This metric evaluates the correctness of the generated answers. A score of 0.48 suggests there is room for improvement, and some answers may be inaccurate.

Overall Analysis:

Strengths: The retriever excels in answer relevancy, context recall, and answer similarity, indicating a good understanding of queries and the ability to retrieve comprehensive and relevant data.

Areas for Improvement: There is room for improvement in filtering context to enhance metrics like answer correctness. Additionally, refining how the retriever interprets and uses context could lead to overall improvement.

References:

Ragas Source:

https://github.com/explodinggradients/ragas/tree/main/src/ragas/metrics

Ragas Paper :https://arxiv.org/abs/2309.15217

--

--