Improving RAG performance: A look at model evaluation tools in the cloud

Ng Shangru
AI Practice GovTech
15 min readMay 24, 2024
Image generated by Firefly 3

From the GovTech AI CapDev Team:

Introduction

This post is part of a larger series on improving RAG performance. You can read the main playbook here. Links to the other articles in the series will be updated when they are published.

Previously, we’ve touched on how to deploy LLMs as well as using them for Retrieval Augmented Generation (RAG). Next, we’ll look at some of the ways to determine whether a specific implementation is better over another, as well as how to go about getting the results.

What is RAG Evaluation?

We could look at the evaluation of RAG systems from 3 different perspectives:

  1. System View: This perspective entails evaluating the overall performance and effectiveness of the RAG system as a whole. It involves examining how well the system integrates retrieval and generation components to produce coherent and relevant outputs.
  2. Component View: In this view, the focus shifts to evaluating the individual components that constitute the RAG system. This includes assessing the quality of the retrieval mechanism in retrieving relevant information and the proficiency of the generation model in producing coherent and fluent text.
  3. Safety: Safety evaluation addresses the ethical considerations and potential risks associated with deploying RAG systems. It involves examining factors such as bias, fairness, and the potential for harmful or misleading outputs, ensuring that the system operates responsibly and ethically.

System View

In a system view, we are interested in whether the RAG can perform the desired target tasks satisfactorily. In the context of question and answering, we want to know:

  1. Whether the system is able to provide answers that can be found in the knowledge base
  2. Whether the system is able to handle different kind of users’ questions and answer them correctly
  3. Whether the system is able to identify that it does not have the answer, and not provide hallucinated answers
  4. Whether the system is able to identify that a question is ambiguous and, hence, not possible to be answered

In general, evaluating an RAG application in this view would require a ground truth dataset which consists of an exhaustive list of questions and ground truth answers, which we match against the system’s generated answers. The performance of the system across the dataset would give us an indication of how comprehensive a system’s coverage of the factual information in the knowledge base is, as well as its ability in handling the different types of questions. Some of the query types that are typically used are as follows:

  • Simple factual query

These types of queries involve the straightforward retrieval of simple facts from the knowledge base. For example: How long do we have to keep a backup of a database? The facts in such queries are usually directly found within the knowledge base. This mirrors common search engine use cases to retrieve factual information.

  • Temporal query

These types of queries require an analysis of the temporal aspect of retrieved information. For example: Did Apple introduce the air tags before or after the launch of the 5th generation iPad pro?

  • Comparison query

These types of queries require a comparison of evidence from the retrieved information. For example: Did Google or Netflix report a higher revenue for the year 2023?

  • Inference query

These queries require that the answer be deduced through reasoning from the retrieved information. For example, given a knowledge base containing sales figures from various tech companies, an inference query might be: Which companies will be impacted the most by new US export restrictions on AI chips?

Component View

While the system view aims to measure the performance of the RAG system according to its intended use cases, the component view tries to ensure the robustness of the system by ensuring the individual subsystems work well under various expected scenarios. This provides a second level of assurance that the RAG system will continue to work well for cases that are not being tested on the system level.

Further details on the component level evaluation can be found in Watson’s article (to be updated).

Safety

Besides the performance of the RAG system to provide the answer that the user is interested in, another key aspect of the system is its robustness to adversarial attacks, as well as whether the system is able to provide responses that are considered safe under some guidelines.

Further details on the safety evaluation can be found in our playbook here.

Another excellent resource would be the blog post from our teammate Jessica on Fairness in LLMs here.

Tools for Evaluating Models on the cloud (GCC)

Given the wide range of evaluation metrics available, the first challenge a developer faces in evaluating an RAG pipeline is in selecting the most appropriate metric to use. Once that has been decided, the next challenge is in implementing the metrics. Most of these metrics are provided in open source frameworks like Hugging Face, Langchain, LlamaIndex, Ragas, and UniEval, and usually do not need to be coded from scratch. However, using the metrics might require some time to set up, especially if the metrics make use of LLMs to determine the relevancy or correctness of a search result or an answer. Fortunately, few-click, no-code solutions are available in GCC to reduce the effort in model evaluation. We look at three of these solutions and summarise their features based on the following:

  1. User Interface
  2. Type of metrics used for auto evaluation
  3. Support for manual evaluation
  4. Models used for evaluation
  5. Support for comparison of multiple models/pipelines
  6. Models available for use in evaluation

Google Vertex AI

Vertex AI provides two different tools for evaluating models:

  • ModelEval evaluates models on certain metrics like AUC, LogLoss and provides results for those metrics
  • AutoSxS arbiter compares results from two models and determines which one is the better answer

The following sections provide examples of how you can trigger the evaluation task. Note that the prerequisite for the scripts is to have gcloud cli setup and logged in, either via ADC or using a credentials json.

Model Eval

To run a model eval task, you can use the following script:

from google.auth import default
import vertexai
from vertexai.preview.language_models import (
EvaluationTextClassificationSpec,
TextGenerationModel,
)


# Set credentials for the pipeline components used in the evaluation task
credentials, _ = default(scopes=["https://www.googleapis.com/auth/cloud-platform"])

PROJECT_ID = "<PROJECT>"
LOCATION = "us-central1"


def evaluate_model(
project_id: str,
location: str,
) -> object:
"""Evaluate the performance of a generative AI model."""
vertexai.init(project=project_id, location=location, credentials=credentials)

# Create a reference to a generative AI model
model = TextGenerationModel.from_pretrained("text-bison@001")

# Define the evaluation specification for a text classification task
task_spec = EvaluationTextClassificationSpec(
ground_truth_data=["gs://cloud-samples-data/ai-platform/generative_ai/llm_classification_bp_input_prompts_with_ground_truth.jsonl"
],
class_names=["nature", "news", "sports", "health", "startups"],
target_column_name="ground_truth",
)

# Evaluate the model
eval_metrics = model.evaluate(task_spec=task_spec)
print(eval_metrics)

return eval_metrics

evaluate_model(project_id=PROJECT_ID, location=LOCATION)

You can expect to get a result like this — note the last line where the metric values are shown.

AutoSxS

To run a model comparison task using the AutoSxS arbiter, you can use the following script:

import os
from google.cloud import aiplatform

parameters = {
'evaluation_dataset': 'gs://capdev_model_eval/autosxs_dataset.jsonl',
'id_columns': ['id'],
'task': 'question_answer@001', # Evaluation task in the form {task}@{version}. task can be one of [summarization, question_answer]. version is an integer with three digits or latest. Ex: summarization@001 or question_answer@latest.
'autorater_prompt_parameters': {
"inference_instruction": {"column": "question"},
"inference_context": {"column": "context"}
},
'response_column_a': 'model_a_response',
'response_column_b': 'model_b_response',
}

aiplatform.init(project='<PROJECT>', location='us-central1', staging_bucket='gs://capdev_model_eval')
aiplatform.PipelineJob(
display_name='autosxs_test_PIPELINEJOB_DISPLAYNAME',
pipeline_root=os.path.join('gs://capdev_model_eval', 'autosxs_test_PIPELINEJOB_DISPLAYNAME'),
template_path=(
'https://us-kfp.pkg.dev/ml-pipeline/google-cloud-registry/autosxs-template/default'),
parameter_values=parameters,
).run()

Note that AutoSxS does not currently support loading custom models (even if the model is hosted in Google’s Vertex AI Model Garden). The only support model is their text-bison, so it’s easier to just bring your own pre-generated model responses.

For the two supported task types, you can use these templates as a base for the input dataset. Note that the file format needs to be in JSONL instead of JSON, and the file has to be uploaded to GCS.

Sample dataset for Summarisation task:

{"id": "1", "document": "The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed. Repair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water. Trains on the west coast mainline face disruption due to damage at the Lamington Viaduct. Many businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town...","response_a":"Damage assessment ongoing in Newton Stewart and surrounding areas. Repair work underway in Hawick and Peeblesshire. Trains disrupted on west coast mainline due to Lamington Viaduct damage. Flooding affects businesses and householders in Newton Stewart.","response_b":"Newton Stewart and surrounding areas assessing damage. Hawick and Peeblesshire roads affected by standing water. Disruption on west coast mainline due to Lamington Viaduct damage. Flooding impacts businesses and householders in Newton Stewart."}
{"id": "2", "document": "A fire alarm went off at the Holiday Inn in Hope Street at about 04:20 BST on Saturday and guests were asked to leave the hotel. As they gathered outside they saw the two buses, parked side-by-side in the car park, engulfed by flames...","response_a":"Fire alarm at Holiday Inn in Hope Street prompts evacuation. Two buses in car park caught fire.","response_b":"Holiday Inn in Hope Street evacuated due to fire alarm. Two parked buses engulfed in flames."}
{"id": "3", "document": "Ferrari appeared in a position to challenge until the final laps, when the Mercedes stretched their legs to go half a second clear of the red cars. Sebastian Vettel will start third ahead of team-mate Kimi Raikkonen. ...","response_a":"Ferrari initially challenges but Mercedes pulls ahead in final laps. Sebastian Vettel to start third.","response_b":"Ferrari challenges but Mercedes leads in final laps. Sebastian Vettel starts third."}

Sample dataset for Q&A task:

{"id": 1, "question": "What causes earthquakes?", "context": "Earthquakes are caused by the sudden release of energy along fault lines in the Earth's crust.", "model_a_response": "Earthquakes occur due to the movement of tectonic plates.", "model_b_response": "Earthquakes are caused by the sudden release of energy along fault lines in the Earth's crust."}
{"id": 2, "question": "What is the capital of France?", "context": "France is a country located in Western Europe.", "model_a_response": "The capital of France is Paris.", "model_b_response": "Paris is the capital city of France."}
{"id": 3, "question": "Who wrote the novel 'Pride and Prejudice'?", "context": "'Pride and Prejudice' is a classic novel published in 1813.", "model_a_response": "'Pride and Prejudice' was written by Jane Austen.", "model_b_response": "The author of 'Pride and Prejudice' is Jane Austen."}

The arbiter will respond with a choice for each task and explanation of why it was chosen. Here are some sample results:

Summarisation:

Q&A:

AWS Bedrock

At the time of writing this is not yet available in the SG region. Testing was done with the N. Virginia region

AWS Bedrock provides a simple tool to perform model evaluation based on metrics like accuracy and toxicity. It also leverages AWS SageMaker to add in human loops for manual human evaluation, which can be provided by the user or by AWS via AWS Mechanical Turk.

Setting up an evaluation task

Step 1: Select an evaluation task

Step 2: Configure the evaluation parameters (automatic evaluation in this example)

Note the available evaluation task types.

Step 3: Create and wait for evaluation to complete

Step 4: View results

The web console provides a summary view of the evaluation results for the metrics like accuracy and toxicity.

A more detailed per-entry evaluation result can be found in the output file

For human based evaluation, your humans will get an annotation task to manually provide evaluation results, instead of it being system generated and tabulated, like so:

Azure AI Studio

Note that Azure AI Studio is different from Azure OpenAI Studio. As of writing this tool is still in Public Preview.

AI Studio provides an evaluation service to measure how well your model performs on primarily Q&A use cases.

Setting up an evaluation task

Step 1: Select the type of evaluation task

Step 2: Select the metrics to measure

Step 3: Upload dataset and map columns

Step 4: Run the evaluation

Step 5: View Results

For evaluation on Q&A without context, the tool helps provide an assessment of how well the model answer fits with relevance and groundedness against the source data.

For evaluation on Q&A with context, the tool also provides a summary of the metric scores obtained

There is also the option of setting up a manual evaluation task where users will manually evaluate each entry and provide a thumbs up or thumbs down

Dataset Requirements

The dataset should contain minimally the question, answer, and ground truth columns. A context column is also required depending on the evaluation task selected.

Sample Dataset

question,ground_truth,context,answer
What are the efforts made by the Singapore government to raise awareness of the various support schemes available for seniors?,"The Singapore government undertakes a wide range of communications efforts to raise awareness of the various support schemes available for seniors. These include printing information in newspapers like The Straits Times, Lianhe Zaobao, radio capsules, and summary graphics in digital display panels across the Housing and Development Board (HDB) estates and condominium developments. The government also conducts face-to-face interactions between Silver Generation Ambassadors and the seniors to explain the schemes and what they are eligible for.","{'7': 'So, the Government undertakes a wide range of communications efforts to raise awareness of the various support schemes available, including what I mentioned just now, the cost-of-living support. For example, the most recent cost-of-living support has been printed in newspapers, like The Straits Times, Lianhe Zaobao and on radio capsules, and we have also put them in summary graphics in digital display panels across the Housing and Development Board (HDB) estates and condominium developments.', '11': 'Part of it, we will go through the Silver Generation Office. The face-to-face interactions between our Silver Generation Ambassadors and the seniors would be helpful, to explain to them some of these schemes and what they are eligible for.'}"," The Singapore government makes extensive efforts to raise awareness of the various support schemes available for seniors through multiple channels. Some of these efforts include printing information about the schemes in newspapers such as The Straits Times and Lianhe Zaobao, broadcasting it on radio capsules, and displaying summaries in digital panels across Housing and Development Board (HDB) estates and condominium developments. Additionally, the government goes through the Silver Generation Office to facilitate face-to-face interactions between Silver Generation Ambassadors and seniors to explain the schemes and determine their eligibility."
What are the government's efforts to raise awareness and improve communication about the various support schemes available for seniors in Singapore?,"The government undertakes a wide range of communications efforts to raise awareness of the various support schemes available, including cost-of-living support. These efforts include printing information in newspapers, broadcasting on radio capsules, and displaying summary graphics in digital display panels across Housing and Development Board estates and condominium developments. The Silver Generation Office also plays a role in engaging and extending support to seniors as needed.","{'7': 'So, the Government undertakes a wide range of communications efforts to raise awareness of the various support schemes available, including what I mentioned just now, the cost-of-living support. For example, the most recent cost-of-living support has been printed in newspapers, like The Straits Times, Lianhe Zaobao and on radio capsules, and we have also put them in summary graphics in digital display panels across the Housing and Development Board (HDB) estates and condominium developments.', '10': 'Beyond cash support and subsidies, the Government will continue to work together with our community partners, such as the Silver Generation Office, to jointly engage and extend support to our seniors as needed.'}"," The Singaporean Government engages in a comprehensive range of communication efforts to inform seniors about the support schemes available to them. These efforts include printing information about cost-of-living support in newspapers such as The Straits Times and Lianhe Zaobao, producing radio capsules, and displaying summaries in digital panels across Housing and Development Board (HDB) estates and condominium developments. Additionally, the Government collaborates with community partners like the Silver Generation Office to extend support and engage with seniors as needed."
What are the government's plans to address the challenges of an ageing society in Singapore?,"The government has embarked on a Healthier SG strategy to encourage Singaporeans to lead healthier lifestyles, which will help delay the onset of frailty for all seniors. They are also expanding the network of Active Ageing Centres and transforming their operating model to support healthy ageing for seniors. Furthermore, the Ministry of Health plans to expand aged care capacity and enhance home-based care to enable seniors to stay at home for as long as possible.","{'2': 'Encouraging seniors to adopt active lifestyles and practice good health habits is the most sustainable and effective way to address the challenge of an ageing population with rapidly rising number of people with chronic and other illnesses. To do so, we need to strengthen our ageing infrastructure in the community. Many of our initiatives are laid out in a recent 2023 Action Plan for Successful Ageing.', '3': 'The Government recognises the need to address the challenges of an ageing society, which is associated with a growing care burden and potentially more elderly caregivers. This will require a systematic approach. First, the Government has embarked on a Healthier SG strategy to encourage Singaporeans to lead healthier lifestyles. This will help delay the onset of frailty for all seniors. Second, the Government is undertaking further steps to empower seniors to live and age well in the community.', '4': 'MOH will expand aged care capacity and enhance home-based care to enable seniors to stay at home for as long as possible. MOH also plans to collaborate with stakeholders, such as Agency for Integrated Care and community care providers, to provide social and community support, to enable seniors to age in communities.'}"," The Singaporean Government is addressing the challenges of an ageing society through a multi-faceted approach. Here are some of their key plans as outlined in the 2023 Action Plan for Successful Ageing and other initiatives:

1. Encouraging active lifestyles and good health habits among seniors: The Government recognizes the importance of preventing chronic illnesses and delaying the onset of frailty in seniors. To achieve this, they have launched the Healthier SG strategy to encourage healthier lifestyles for all Singaporeans.
2. Expanding aged care capacity and enhancing home-based care: The Ministry of Health (MOH) aims to expand aged care capacity and enhance home-based care to enable seniors to age in place and stay at home for as long as possible.
3. Collaborating with stakeholders: MOH plans to work with organizations such as the Agency for Integrated Care and community care providers to provide social and community support for seniors, enabling them to age in their communities.

These initiatives aim to strengthen the ageing infrastructure in Singapore and support seniors in living healthy and active lives while ageing in place in their communities."

Conclusion

The tools available on the cloud have made it easier to obtain a basic understanding of how models compare against each other across a common set of values, but at the same time are limited in providing a comprehensive evaluation across the entire RAG pipeline (i.e. missing component view). Notably, traditional metrics such as ROUGE and BLEU are also absent, so supplementary tests may still be required.

There is currently no standardised way of evaluation across the CSPs, so tooling choice should depend on the use case at hand. For example, if you want to test your model against GPT4, there’s only Azure available. Likewise, only AWS currently integrates with AWS SageMaker to have a human loop evaluation mechanism.

Each of these evaluations also provide different metrics when used for auto evaluation, so this should also be considered when selecting a tool.

Thanks for reading!

Appendix

The table below provides a summary view of the tested options:

--

--