VAQ #2— How to evaluate LLMs with custom criteria using Vertex AI AutoSxS

Published in

Google Cloud - Community

10 min read6 days ago

Figure #1 — The Vertex AI Q&A (VAQ) series logo

One of the most common methods to compare large language models (LLMs) is using an LLM-as-Judge. Explored in Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, this method takes advantage of another LLM to assess and score the quality of the generated responses.

In the GenAI Model Evaluation framework on Vertex AI, Google Cloud introduced a managed and scalable LLM-as-Judge service that employs an autorater model to compare two LLMs on specific tasks such as Q&A and Summarization. To learn more about AutoSxS, check out “Beyond recall: Evaluating Gemini with Vertex AI Auto SxS” blog post.

Figure 2 — Vertex AI GenAI Evaluation framework

In the past few months, one of the most popular VAQ (Vertex AI Asked Question) about AutoSxS was the following:

Can Vertex AI AutoSxS be used to evaluate custom tasks?

Indeed, the answer is yes! This article provides a practical demonstration of how to use Vertex AI AutoSxS with a Translation task, showcasing how to evaluate and compare the performance of LLMs on custom tasks using specific evaluation criteria. By the end of this article, you will gain a clear understanding of how to leverage Vertex AI AutoSxS to implement custom evaluation routines with your GenAI applications.

Evaluating Gemini on Vertex AI in translation task

Let’s imagine you’ve developed a GenAI application within a media newspaper company. It’s designed to streamline internal communications by translating emails. To keep things simple, let’s focus on a single translation scenario, from English to Italian. Currently, it relies on the Gemini 1.5 Pro model on Vertex AI. However, the new Gemini 1.5 Flash model promises improved efficiency. So, you want to conduct an A/B test to see how its translation performance stacks up against the existing model. To do this, you’ve gathered a representative sample of translations, similar to the JSONL example below.

{"business_unit":"editorial","content":"Subject: Urgent: Fact-Checking Needed - Mayor Wilson Article\n\nFrom: Ivan Nardini <inardini@cymbalmedia.com>\n\nTo: Editorial Team <editorial@cymbalmedia.com>\n\nHi Team,\n\nWe need an urgent fact-check on the Mayor Wilson piece slated for tomorrow's front page...","model_a_response":"## Oggetto: Urgente: Verifica dei Fatti Necessaria - Articolo Sindaco Wilson\n\nDa: Ivan Nardini <inardini@cymbalmedia.com>\n\nA: Team Editoriale <editorial@cymbalmedia.com>\n\nCiao Team,\n\nAbbiamo bisogno di una verifica urgente dei fatti per l'articolo sul Sindaco Wilson previsto per la prima pagina di domani...","model_b_response":"## Oggetto: Urgente: Verifica dei fatti necessaria - Articolo Sindaco Wilson\n\nDa: Ivan Nardini <inardini@cymbalmedia.com>\n\nA: Redazione <editorial@cymbalmedia.com>\n\nCiao a tutti,\n\nabbiamo bisogno di una verifica urgente dei fatti per l'articolo sul Sindaco Wilson previsto per la prima pagina di domani..."}
{"business_unit":"advertising","content":"Subject: Cymbal Advertising: Elevate Your Brand with Our Fall Campaign Solutions\n\nDear [Recipient Name],\n\nI hope this email finds you well.\n\nMy name is Ivan Nardini, and I'm an Advertising Specialist at Cymbal, a leading name in news media, reaching millions of readers daily across our print and digital platforms...","model_a_response":"## Oggetto: Cymbal Advertising: Fai Ascendere il Tuo Brand con le Nostre Soluzioni per le Campagne Autunnali\n\nGentile [Nome del Destinatario],\n\nSpero che questa email ti trovi bene.\n\nMi chiamo Ivan Nardini e sono un Advertising Specialist presso Cymbal, un nome leader nel settore dei media, che raggiunge milioni di lettori ogni giorno attraverso le nostre piattaforme cartacee e digitali...","model_b_response":"## Oggetto: Pubblicit\u00e0 su Cymbal: Fai crescere il tuo brand con le nostre soluzioni per la campagna autunnale\n\nGentile [Nome del destinatario],\n\nspero che questa email ti trovi bene.\n\nMi chiamo Ivan Nardini e sono uno specialista della pubblicit\u00e0 presso Cymbal, un nome leader nel settore dei media, che raggiunge milioni di lettori ogni giorno attraverso le nostre piattaforme cartacee e digitali..."}

Where the business_unit represents the company unit from where you collected the email, content is the original email text you want to translate, model_a_response and model_b_response correspond to the Gemini 1.5 Pro and Gemini 1.5 Flash translated emails respectively.

To compare LLMs with respect to a specific task in a business context, you start by defining evaluation criteria. Given the nature of this translation task, you may have the following custom evaluation criteria:

Correctness — The translation accurately conveys the meaning, nuances, and style of the content text in the italian language, while maintaining fluency and appropriateness for the intended audience and purpose.
Grammatical correctness — The translation adheres to the grammatical rules of the Italian language, including syntax, morphology, and word choice.
Mistranslation — The translation contains errors where the meaning of the content text is not accurately conveyed in the italian language.
Fluency — The translation reads smoothly and naturally in the Italian language, adhering to its grammatical structures and idiomatic expressions.
Loanwording — The translation leaves individual words or phrases in the italian language untranslated.

Then, for each criteria, you would like to define a custom score to assess model responses point by point, ensuring alignment with the criteria you define. In this scenario, you may define a custom score within the range of 1 to 5, where 5 denotes the highest quality translation (most correct, fluent etc…), and 1 indicates the lowest quality (least correct, fluent, etc…).

With the new AutoSxS custom task, it is quite straightforward to implement an A/B evaluation test and assess the model performance based on your specific evaluation criteria, score and evaluation dataset. For evaluation criteria, you define a list of criteria, each represented as a dictionary with the following keys:

name: the criterion’s name
definition: the criterion’s definition.
criteria_steps: an optional step-by-step guide for the AutoRater on assessing criterion fulfillment.

In the following, you have the evaluation criteria you would define in this translation task scenario.

EVALUATION_CRITERIA = [
   {
       'name': 'Correctness',
       'definition': 'The translation accurately conveys the meaning, nuances, and style of the content text in the italian language, while maintaining fluency and appropriateness for the intended audience and purpose.',
       'evaluation_steps': [
           'Compare the responses containing translated emails.',
           'Assess which translated email accurately reflects the meaning of the content text in the italian language.',
           'Consider which translated email maintains the original style and tone.',
           'Ensure the translated email is fluent and natural-sounding in the italian language.',
       ],
   },
   {
       'name': 'Mistranslation',
       'definition': 'The translation contains errors where the meaning of the content text is not accurately conveyed in the italian language.',
       'evaluation_steps': [
           'Compare the responses containing translated emails.',
           'Identify any discrepancies in meaning between translated emails and the content text.',
           'Determine if the discrepancies are significant enough to be considered mistranslations.',
           'Consider if cultural references or nuances are lost or misinterpreted in the translation.',
       ],
   },
...
]

Once you’ve defined the evaluation criteria, you can establish a scoring rubric. This score_rubric is a dictionary that assigns a numerical score based on the degree to which each criterion is fulfilled. Let’s look at an example of a score rubric tailored to our translation scenario.

SCORE_RUBRIC = {
   '5': 'Response is an excellent translation that accurately conveys the meaning and nuances of the content text while maintaining fluency and naturalness in the italian language. It adheres to grammar and style conventions, ensuring the content is appropriate and relevant.',
   '4': 'Response is a good translation that accurately conveys the meaning of the content text with minor issues in fluency, style or cultural appropriateness.',
   ...
   '1': 'Response is a poor translation that fails to convey the meaning of the content text, is full of errors, and is not suitable.'
}

Notice that defining the correct precision of the score is important. In fact, it is preferable to use a low-precision integer scale for evaluating large language models (LLMs).

It is important to highlight that although the LLM-as-judge pattern is one of the most commonly used, some discussions are still going-on about its consistency. One way to make its evaluation more usable and reliable is by providing few-shot examples. Adding a few-shot learning examples seems to improve the consistency of scores, especially when they are combined with detailed grading, and consequently enhances the grading results.

With AutoSxS, you can pass few shot examples as an additional JSONL file where each record has the following schema

   [
        {
            "business_unit": "digital/online",
            "content": """
                  Subject: Urgent: Website Downtime Impact on Subscription Revenue - Action Required

                 ...
        """,
            "response_a": """
                  Oggetto: Urgente: Interruzione del sito web e impatto sulle entrate da abbonamento - Azioni necessarie

                  ...

        """,
            "response_b": """
                  Oggetto: Urgente: Impatto del downtime del sito web sulle entrate degli abbonamenti - Azioni richieste
 ...
        """,
            "response_a_score": 2,
            "response_b_score": 4,
            "winner": "B",
            "explanation": "Both A and B provide accurate and fluent translations of the original English email. However, B demonstrates a slightly better grasp of natural language flow and idiomatic expressions, making it the slightly better translation overall.",
        }
    ]

where response_a and response_a are the two Italian translations of an English email (content) for each business unit (business_unit), along with scores (response_a_score and response_b_score), the winning response (winner) and an explanation indicating some reasons about the autorater preference.

Notice how you have to provide simple yet representative examples that reflect the task together with calibration scores based on your predefined scoring rubric. And these scores will guide the AutoRater’s evaluation. Also it is strongly recommended providing numerous examples spanning the full range of scores and criteria to ensure comprehensive rating. But remember that, more examples consume more tokens sent to the autorater, impacting your available context length. For instance, if your model has an 8000-token context limit and your few-shot examples total 2000 tokens, your context length will be reduced to 6000 tokens.

After you define the few-shot examples and upload them in Cloud Storage, with the evaluation dataset, evaluation criteria, and the score rubric, you can now define the parameters of an AutoSxS evaluation job and submit a new evaluation as below.

from google.cloud import aiplatform
from google_cloud_pipeline_components.preview import model_evaluation 

display_name = f"autosxs-custom-evaluate-{generate_uuid()}"

PIPELINE_PARAMETERS = {
    "evaluation_dataset": EVALUATION_DATASET_URI,
    "id_columns": ["content"],
    "task": "custom@001",
    "response_column_a": "model_a_response",
    "response_column_b": "model_b_response",
    "autorater_prompt_parameters": {
        "instruction": {"template": (INSTRUCTION)},
        "content": "content",
    },
    "experimental_args": {
        "custom_task_definition": {
            "evaluation_criteria": EVALUATION_CRITERIA,
            "autorater_prompt_parameter_keys": [
                "instruction",
                "content",
            ],
            "few_shot_examples_config": {
                "dataset": FEW_SHOT_EXAMPLES_DATASET_URI,
                "autorater_prompt_parameters": {
                    "instruction": {"template": (INSTRUCTION)},
                    "content": "content",
                },
                "response_a_column": "response_a",
                "response_b_column": "response_b",
                "choice_column": "winner",
                "explanation_column": "explanation",
                "response_a_score_column": "response_a_score",
                "response_b_score_column": "response_b_score",
            },
            "score_rubric": SCORE_RUBRIC,
        }
    },
}


template_uri = str(pipeline_path / "pipeline.yaml")
compiler.Compiler().compile(
    pipeline_func=model_evaluation.autosxs_pipeline,
    package_path=template_uri,
)

pipeline_job = aiplatform.PipelineJob(
    job_id=display_name,
    display_name=display_name,
    template_path=template_uri,
    pipeline_root=PIPELINE_ROOT_URI,
    parameter_values=PIPELINE_PARAMETERS,
    enable_caching=False,
)
pipeline_job.run()

Regarding parameters, you have to provide the evaluation dataset parameters, such as the path to your dataset (evaluation_dataset), task type (task), and unique identifiers (id_columns). Also, you need to provide the autorater guidance parameters for the autorater’s prompt, including (autorater_prompt_parameters) and custom task details (experimental_args). Finally, you need to specify the model details parameters which in this case are represented by prediction column names (response_column_a, response_column_b) because of pre-generated predictions. Refer to the documentation for detailed configuration options and additional parameters for features like exporting judgments and checking alignment with human preferences.

Concerning the evaluation job, you can optionally compile the AutoSxS Model Evaluation pipeline locally and submit the evaluation job to initiate a Vertex Pipeline job, which can be monitored through the Vertex UI as shown below.

Figure 3— Vertex AI AutoSxS with custom task pipeline run

The AutoSxS evaluation pipeline job produces several artifacts, including judgments and AutoSxS win-rate metrics. By analyzing these outputs, you can gain a comprehensive understanding of both individual example performance and overall model comparison.

You can access the evaluation results in the Vertex AI Pipelines by examining the artifacts generated by the AutoSxS pipeline using Vertex AI SDK for Python as the following.

pipeline_job = aiplatform.PipelineJob.get(...)

for details in pipeline_job.task_details:
    if details.task_name == "online-evaluation-pairwise":
        break

# Judgments
judgments_uri = details.outputs["judgments"].artifacts[0].uri
judgments_df = pd.read_json(judgments_uri, lines=True)

# Aggregate metrics
for details in pipeline_job.task_details:
    if details.task_name == "model-evaluation-text-generation-pairwise":
        break
win_rate_metrics = details.outputs["autosxs_metrics"].artifacts[0].metadata

Below you have an example of the judgments table. The judgments table provides example-level metrics, including inference prompts, model responses, autorater decisions, rating explanations, and confidence scores.

Figure 4 — Vertex AI AutoSxS judgement results

The aggregate metrics offers an overview of model performance, such as the AutoRater model A win rate, which indicates the percentage of times the autorater preferred model A’s response.

Interestingly, in this scenario, AutoSxS Autorater prefers 50% of time Model A over Model B translations. This is an encouraging result because you can consider switching Gemini 1.5 Flash for this translation task, possibly reducing latency and cost of your GenAI application!

Conclusions

One of the most frequent questions I have received about Vertex AI AutoSxS is:

Can Vertex AI AutoSxS be used to evaluate custom tasks?

Absolutely! The new AutoSxS custom task enables you to create your own evaluation criteria and tasks. This article showcased how Vertex AI AutoSxS can be used to evaluate and compare the performance of LLMs on customized translation tasks by defining specific evaluation criteria.

Please note that there are several parameters to configure when evaluating a custom task. To ensure that the task performs as expected, you may want to consider tuning parameters with prompt engineering based on a test dataset. And if you want to know how, keep following me, as more exciting content is coming your way 🤗.

What’s Next

Do you want to know more about Vertex AI GenAI Evaluation framework and how to use it? Check out the following resources!

Documentation

Github samples

Evaluate Gemini with AutoSxS on a custom task notebook

Vertex AI Q&A (VAQ) Series

This article is part of the Vertex AI Q&A (VAQ) series, an interactive FAQ (frequently asked questions) series where I will answer questions you have on Vertex AI based on my personal experience and what I learned.

Thanks for reading

I hope you enjoyed the article. If so, 𝗙𝗼𝗹𝗹𝗼𝘄 𝗺𝗲 and👏 this article or leave comments. Also let’s connect on LinkedIn or X to share feedback and questions 🤗 about Vertex AI you would like to find an answer.