Beyond recall: Evaluating Gemini with Vertex AI Auto SxS

Ivan Nardini
Google Cloud - Community
9 min readFeb 9, 2024
Evaluating Gemini with Vertex AI Auto SxS

Evaluating large language models (LLMs) remains a challenging task. One reason is the lack of comprehensive metrics that capture the extensive capabilities of LLMs. Furthermore, there is a need for metrics that quantify the level of agreement between human preferences and LLM performance.

Suppose you are building a summarization feature for your web-based newspaper using LLMs. To identify the most effective LLMs, you can utilize task-specific metrics from the ROUGE family. For instance, ROUGE-L leverages the longest common subsequence (LCS) of words to compute a score between 0 and 1 with higher scores indicating higher similarity between the automatically-produced summary and the reference. But the way ROUGE-L is calculated, the metrics don’t capture semantic similarity or fluency, and it is prone to bias towards longer summaries.

So, for each newspaper article, you might want to consider having human reviewers assigned a score to some specific aspects including overall fluency, coherence, coverage, factual accuracy, and more. However, this approach can be costly and time-consuming. To illustrate, consider the following summaries from the XSum dataset:

Figure 2. Document samples

Start a timer and stop it after you’ve reviewed each summary and chosen your preferred summary. Imagine having to do this multiple times for each summary pair you review. How long might it take?

To eliminate the time consuming and repeated effort of review and repeat, it is essential to utilize an automated method to scale the evaluation ofLLMs, and verify their alignment with human preference. One possible approach is to use an “LLM-as-a-judge” pattern to assist humans in the LLMs’ evaluation.

As part of the Model Evaluation for Generative AI (GenAI), Vertex AI introduces AutoSxS, an on-demand evaluation tool that compares twoLLMs side by side.

This blog post provides a comprehensive overview of AutoSxS, followed by a step-by-step guide on getting started with AutoSxS. By the end of this reading, you will have a better understanding of how to use AutoSxS to evaluate LLMs in your framework for operationalized GenAI applications.

Introduction to Vertex AI AutoSxS

According to the official documentation, Automatic side-by-side (AutoSxS) is a model-assisted evaluation tool that compares two large language models (LLMs) side by side. AutoSxS utilizes the autorater to determine the better response to a prompt.

Figure 3. AutoSxS in a picture

The autorater model is a large language model (LLM) developed by Google to assess the quality of model responses based on a given inference prompt.

AutoSxS allows users to evaluate the performance of any GenAI model. It supports summarization and QA tasks with a 4,096 input token limit. AutoSxS also provides tooling to scrutinize evaluation results, including explanations and confidence scores for each decision.

AutoSxS evaluates responses based on predefined criteria. The following summarization criteria are designed to make language evaluation more objective and to improve response quality:

Figure 4. Summarization criteria

At this point, you should have a better comprehension of AutoSxS. If you want to know more, check out the detailed documentation. Now, let’s dive into how you can get started with AutoSxS.

Get started with Vertex AI AutoSxS

Imagine you’re in the process of comparing Gemini-Pro with another LLM to determine the best choice for the summarization feature of your web-based newspaper. As you can see in the diagram, utilizing AutoSxS for model evaluation involves a three-step process.

Figure 5. Get started with Vertex AI AutoSxS

First of all, you must prepare the evaluation dataset, which means that you must collect your prompts, contexts, and the generated responses required for running the evaluation. You must also convert the evaluation dataset either in JSONL format or in a tabular format. Here’s an example of JSONL format for the summarization use case using XSum dataset.

{"id": "1", "document": "The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed. Repair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water. Trains on the west coast mainline face disruption due to damage at the Lamington Viaduct. Many businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town...”},
{"id": "2", "content": "A fire alarm went off at the Holiday Inn in Hope Street at about 04:20 BST on Saturday and guests were asked to leave the hotel. As they gathered outside they saw the two buses, parked side-by-side in the car park, engulfed by flames...”}
{"id": "3", "content": "Ferrari appeared in a position to challenge until the final laps, when the Mercedes stretched their legs to go half a second clear of the red cars. Sebastian Vettel will start third ahead of team-mate Kimi Raikkonen. ..."}

Each row in the evaluation dataset contains several fields, where each field represents a single example. The fields include ID and document fields, which are used to identify each unique example and the newspaper articles to summarize. There might be data fields that can serve as prompts and contexts for filling out AutoSxS prompt templates. In this case, however, these fields are not included. Instead, there are pre-generated predictions containing the generated response according to the LLMs task. These are represented by the response_a and response_b fields, which represent different article summaries. Finally, there might be human preferences that are used to compare the AutoSxS autorater preference data with the validated human preference data.

For optimal evaluation, the documentation advises utilizing examples that mirror real-world scenarios in which you deploy your LLM-based application. Additionally, you must have a minimum number of evaluation examples in your dataset. It is recommended to use at least 400 examples to obtain reliable aggregate metrics.

Once your evaluation dataset is created, you have the option to upload the dataset to a Google Cloud bucket, or save it as a BigQuery table. After uploading or saving your dataset, you can define the parameters for your model evaluation pipeline.

AutoSxS uses Vertex AI pipelines and a predefined pipeline template, which includes all essential steps for running a model evaluation job. In the following example code,you can see some of the supported pipeline parameters:

parameters = {
'evaluation_dataset' : 'gs://path/to/evaluation_dataset.jsonl',
'id_columns': ['id', 'document'],
'task': 'summarization',
'autorater_prompt_parameters': {
'inference_context': {'column': 'document'},
'inference_instruction': {'template': "Summarize the following text: {{ document }}."}
},
'response_column_a': 'response_a',
'response_column_b': 'response_b',
}

The evaluation_dataset parameter specifies the location of the evaluation dataset, which in this case is a JSONL Cloud bucket URI. The id and document fields uniquely identify each evaluation example. The evaluation_task parameter defines the type of task to evaluate, such as summarization or question_answer, in the format {task}@{version}. In this example, you have a summarization task. Also, AutoSxS provides support for configuring the autorater’s behavior, including specifying inference instructions to guide the task completion, and setting the inference context for reference during task execution. Finally, users have to provide the names of columns containing predefined predictions in order to calculate the evaluation metrics.

Once you define yourparameters, you are ready to run an AutoSxS evaluation job. Submitting an AutoSxS model evaluation pipeline job on Vertex AI is a straightforward process. Initiate a pipeline job by passing defined parameters and the pipeline template. Then, submit a pipeline run using the Vertex AI Python SDK, as shown in the following example:

from google.cloud import aiplatform

# initiate vertex ai client
aiplatform.init(project=project_id, location='location', staging_bucket='gs://to/path/stage_bucket')

# run evaluation pipeline job
evaluation_job = aiplatform.PipelineJob(
display_name='autosxs-evaluation-pipeline-job',
pipeline_root='gs://path/to/pipeline_bucket',
template_path=(
'https://us-kfp.pkg.dev/ml-pipeline/llm-rlhf/autosxs-template/2.8.0'),
parameter_values=parameters,
).run()

After the evaluation pipeline successfully runs, you can review the evaluation results in the Vertex AI Pipelines UI by looking at artifacts generated by the pipeline itself.

Figure 6. Vertex AI AutoSxS — Without Human Preference

Also you can retrieve those results programmatically using Vertex AI Python SDK.

from google.protobuf.json_format import MessageToDict

for details in job.task_details:
if details.task_name == 'autosxs-arbiter':
break

judgments_uri = MessageToDict(details.outputs['judgments']._pb)['artifacts'][0]['uri']
judgments_df = pd.read_json(judgments_uri, lines=True)
print_autosxs_judgments(df, n=3)

Below you have an example of AutoSxS Judgments output.

Figure 7. Vertex AI AutoSxS results sample

You can see that the summary from model B ( Gemini-Pro summary) was preferred over the summary from model A due to its better coverage and coherence.

AutoSxS also provides aggregated metrics as an additional evaluation result. These win-rate metrics are calculated by utilizing the judgments table to determine the percentage of times the autorater preferred a specific model response. These metrics are relevant for quickly identifying the best model in the context of the evaluated task. Again, you can retrieve the win-rate metrics using Vertex AI Python SDK.

from google.protobuf.json_format import MessageToDict

for details in job.task_details:
if details.task_name == 'autosxs-metrics-computer':
break

win_rate_metrics = MessageToDict(details.outputs['autosxs_metrics']._pb)['artifacts'][0]['metadata']
print_aggregated_metrics(win_rate_metrics)

#AutoSxS Autorater prefers 88% of time Model B over Model A

In this case, the AutoSxS autorater preferred the responses from Model B (Gemini-Pro) over Model A (OSS model) 88% of the time.

At this point you might wonder, how about validating these AutoSxS judgements with human preferences? AutoSxS does support human preferences. To use this feature, you must provide some additional information and parameters in the AutoSxS pipeline.

With respect to the evaluation dataset, you must add a column, which contains the human preference with respect to the LLM-generated responses. Here’s an example.

{ # same data as shown above, “actuals”: “A”}

{ # same data as shown above, “actuals”: “B”}

{ # same data as shown above, “actuals”: “A”}


...

With respect to the AutoSxS pipeline, you must specify the human preference column in its parameters,. then the evaluation process remains the same. The Vertex AI Python SDK lets you run the evaluation pipeline jobas shown in the following example:

from google.cloud import aiplatform

# set parameters
parameters = {
# same parameters as above,
'human_preference_column': 'actual',
}

# same data as shown above
...

# run evaluation pipeline job
evaluation_job = aiplatform.PipelineJob(
...
parameter_values=parameters,
).run()

Notice that the pipeline remains the same except the aggregated metrics are different. The pipeline now returns additional measurements utilizing human-preference data that you provide.

Figure 8. Vertex AI AutoSxS — Human Preference

You can retrieve additional metrics as shown in this example:

import pprint
from google.protobuf.json_format import MessageToDict

for details in job.task_details:
if details.task_name == 'autosxs-metrics-computer':
break

human_aligned_metrics = {k: round(v, 3) for k, v in MessageToDict(
details.outputs['autosxs_metrics']._pb
)['artifacts'][0]['metadata'].items()}
pprint.pprint(human_aligned_metrics)

In addition to well-known metrics like accuracy, precision, and recall, you will receive both the human preference scores and autorater preference scores. These scores indicate the level of agreement between the evaluations. And to simplify this comparison, Cohen’s Kappa is provided. Cohen’s Kappa measures the level of agreement between the autorater and human rates. It ranges from 0 to 1, where 0 represents agreement equivalent to a random choice and 1 indicates perfect agreement. In this case, It appears that the autorater and the human preferences diverge. It may happen. And this discrepancy would require further investigation!

Conclusions

How do you scale an LLMs’ evaluation, and verify human preference at the same time?

This article shows how you can leverage Vertex AI Model Evaluation AutoSxS to assess the performance of Large Language Models (LLMs) with human preferences in a summarization task.

AutoSxS’s capabilities extend beyond the evaluation of LLM alignment with human preferences. It can also be utilized to compare customized first-party (1P) models with foundational 1P models, which offer valuable insights into their relative performance of summarization and question-answer tasks.

In conclusion, AutoSxS offers a flexible tool for evaluating LLMs. With AutoSxS’s capabilities, you can refine your GenAI applications and achieve improved performance.

What’s Next

Do you want to know more about Vertex AI Model Evaluation AutoSxS and how to use it? Check out the following resources!

Documentation

Github samples

Youtube Video

Thanks for reading

I hope you enjoyed the article. If so, please clap or leave your comments. Also let’s connect on LinkedIn or X to share feedback and questions 🤗

Special thanks to Irina Sigler, Terrie Pugh and the entire Vertex AI Model Evaluation team for their collaboration, support and contribution.

--

--

Ivan Nardini
Google Cloud - Community

Developer relations engineer at @GoogleCloud who is passionate with Machine Learning Engineering. The Lead of MLOps.community’s Engineering Lab.