Navigating Google Cloud’s Vertex AI Auto SxS - A Technical Deep Dive

An innovative tool for AI model evaluation

Published in

Google Cloud - Community

9 min readJan 27, 2024

Dive into Google Cloud Vertex AI Auto SxS, a cutting-edge tool designed for the efficient evaluation of large language models in the cloud

Introduction

Welcome to the world where cloud computing meets advanced AI: a realm dominated by tools like Google Cloud’s Vertex AI Auto SxS. In this article, we’ll embark on an exploratory journey into this powerful tool, designed to revolutionize how we approach model evaluations in AI. As the demand for sophisticated AI solutions skyrockets, tools like Auto SxS become indispensable for developers and data scientists.

Let’s dive in and unravel the capabilities and nuances of Vertex AI Auto SxS.

Understanding Vertex AI Auto SxS

What is Vertex AI Auto SxS?

It’s a sophisticated tool designed for evaluating and comparing AI models, particularly large language models (LLMs). This tool stands at the forefront of model evaluation, offering an automated, efficient, and objective way to assess the performance of different AI models.

Role in Model Evaluations

In the landscape of AI, where models are continually evolving and improving, Vertex AI Auto SxS plays a critical role. It provides a platform for developers to compare models side-by-side, using a set of standardized criteria. This not only streamlines the evaluation process but also ensures consistency and accuracy in the assessment of different models.

Key Features

Description of the Autorater System

The centerpiece of Vertex AI Auto SxS is its autorater system. This innovative component acts as an impartial judge, evaluating the responses generated by different models based on a prompt. The autorater analyzes these responses using a range of criteria, such as accuracy, relevance, and coherence, to determine which model performs better in a given scenario.

Comparison Mechanism for Model Outputs

Auto SxS’s comparison mechanism is straightforward yet powerful. Two models receive the same input prompt and generate their responses. The autorater then compares these responses, assessing each based on predefined criteria. This process not only highlights the strengths and weaknesses of each model but also provides valuable insights into areas for improvement.

Deep Dive into Auto SxS Functionalities

The Autorater Mechanism (How it Works)?

The autorater in Vertex AI Auto SxS functions like an advanced language model itself. It’s trained to understand and evaluate the nuances of language, enabling it to judge the quality of responses from other models. This system ensures an unbiased evaluation, as it relies on pre-set standards rather than subjective human judgment.

Criteria Used for Evaluation

The evaluation criteria used by the autorater are meticulously designed to cover various aspects of a model’s response. These include:

Accuracy: How well the response aligns with the facts or data provided.
Relevance: The appropriateness of the response to the given prompt.
Coherence: The logical flow and clarity of the response.

Model Comparison Process

Input Prompt Handling

In the model comparison process, both models receive identical input prompts. These prompts are designed to test various capabilities of the models, ranging from simple information retrieval to complex problem-solving tasks.

Response Generation and Comparison

Upon receiving the prompt, each model generates its response, which is then fed into the autorater. The autorater evaluates these responses side-by-side, providing a comparative analysis that highlights the strengths and weaknesses of each model in relation to the specific prompt.

Setting Up for Success: Evaluation Datasets

Types of Datasets Supported

Vertex AI Auto SxS can work with various types of datasets, including BigQuery tables and JSON files stored in Cloud Storage. The choice of dataset largely depends on the specific needs of the evaluation and the models being tested.

Best Practices for Dataset Creation

Aim for Real-World Representation: Ensure your dataset closely mimics actual scenarios that the models will face in real-life applications.
Careful Selection of Prompts and Data: Choose prompts and data that reflect the typical challenges and tasks your models will handle in practical situations.

Dataset Requirements

Format and Structure

The format of your evaluation dataset is crucial for the effective functioning of Auto SxS. Typically, datasets should be structured with clearly defined columns, such as ID columns for unique example identification, data columns containing prompt details, and response columns holding model-generated responses.

Example of Dataset Entries

An ideal dataset entry might include:

Context: The background information or scenario for the prompt.
Question: A specific question or task posed to the models.
Model Responses: Pre-generated responses from the models being evaluated.

For instance, if you’re evaluating models on their ability to understand and summarize news articles, a dataset entry might look like this:

Context: “Recent studies show a significant increase in renewable energy adoption globally, driven by advancements in solar and wind energy technologies.”
Question: “Summarize the key developments in renewable energy technologies as mentioned in the context.”
Model A Response: “Global renewable energy usage is on the rise, primarily due to new innovations in solar and wind power.”
Model B Response: “Advancements in technology are leading to increased adoption of renewable energy sources, especially solar and wind energy.”

Integration with Vertex AI

Using the API and SDK

Step-by-Step Guide on Integration

Integrating Auto SxS with your models is streamlined through Vertex AI’s API and Python SDK. The process involves:

Defining prompt parameters for your models.
Setting up the autorater with the necessary instructions and context.
Utilizing the API or SDK to send requests for model evaluations.

Examples of API Calls and Python SDK Usage

API Call: An example API call to Vertex AI might look like a POST request to the pipelineJobs method, with parameters such as the model names, task types, and dataset paths.

POST https://us-central1-aiplatform.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/us-central1/pipelineJobs

Content-Type: application/json
Authorization: Bearer YOUR_ACCESS_TOKEN

{
  "displayName": "my-auto-sxs-evaluation-job",
  "runtimeConfig": {
    "gcsOutputDirectory": "gs://my-output-directory",
    "parameterValues": {
      "evaluation_dataset": "gs://my-bucket/my-evaluation-data.json",
      "id_columns": ["example_id"],
      "task": "summarization@001",
      "autorater_prompt_parameters": {
        "inference_instruction": {
          "column": "summary_instruction"
        },
        "inference_context": {
          "column": "input_text"
        }
      },
      "response_column_a": "model_a_response",
      "response_column_b": "model_b_response",
      "model_a": "projects/YOUR_PROJECT_ID/locations/us-central1/models/MODEL_A_ID",
      "model_a_prompt_parameters": {
        "prompt": {
          "column": "model_a_prompt"
        }
      },
      "model_b": "projects/YOUR_PROJECT_ID/locations/us-central1/models/MODEL_B_ID",
      "model_b_prompt_parameters": {
        "prompt": {
          "column": "model_b_prompt"
        }
      }
    }
  },
  "templateUri": "https://us-kfp.pkg.dev/ml-pipeline/llm-rlhf/autosxs-template/2.8.0"
}

In this example:

Replace YOUR_PROJECT_ID with your Google Cloud project ID.
YOUR_ACCESS_TOKEN should be your OAuth 2.0 access token.
MODEL_A_ID and MODEL_B_ID are the IDs of the models you're comparing.
my-output-directory and my-bucket/my-evaluation-data.json should be replaced with your Cloud Storage bucket and file paths.
example_id, summary_instruction, input_text, model_a_response, model_b_response, model_a_prompt, and model_b_prompt are columns in your evaluation dataset.

To execute this API call, you can use curl in your command line:

curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json; charset=utf-8" \
    -d @request.json \
    "https://us-central1-aiplatform.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/us-central1/pipelineJobs"

This call sets up a pipeline job in Vertex AI for evaluating two models using a specified dataset, with the task set to “summarization@001”. The job will output its results to the specified GCS output directory.

Python SDK Usage: Using Vertex AI’s Python SDK, you can script these evaluations by defining the necessary parameters and executing the evaluation as a batch process.

First, ensure you have installed the Vertex AI SDK for Python. You can install or update it using the following command:

pip install google-cloud-aiplatform

Then, use the following Python script to set up your pipeline job:

import os
from google.cloud import aiplatform

# Define your parameters and replacements
pipelinejob_displayname = "my-auto-sxs-evaluation-job"
project_id = "YOUR_PROJECT_ID"
location = "us-central1"  # Supported location
output_dir = "gs://YOUR_OUTPUT_DIR"  # Cloud Storage URI for output
evaluation_dataset = "gs://YOUR_BUCKET/YOUR_DATASET.json"  # Path to your dataset
task = "summarization@001"  # Replace with your task
id_columns = ["example_id"]  # Replace with your ID columns
autorater_prompt_parameters = {
    "inference_instruction": {"column": "instruction_col"},
    "inference_context": {"column": "context_col"}
}
response_column_a = "response_a_col"
response_column_b = "response_b_col"
model_a = "projects/YOUR_PROJECT_ID/locations/us-central1/models/MODEL_A_ID"
model_b = "projects/YOUR_PROJECT_ID/locations/us-central1/models/MODEL_B_ID"
model_a_prompt_parameters = {"prompt": {"column": "model_a_prompt"}}
model_b_prompt_parameters = {"prompt": {"column": "model_b_prompt"}}

# Initialize the AI Platform client
aiplatform.init(project=project_id, location=location, staging_bucket=output_dir)

# Create and run the pipeline job
pipeline_job = aiplatform.PipelineJob(
    display_name=pipelinejob_displayname,
    template_path="https://us-kfp.pkg.dev/ml-pipeline/llm-rlhf/autosxs-template/2.8.0",
    parameter_values={
        "evaluation_dataset": evaluation_dataset,
        "id_columns": id_columns,
        "task": task,
        "autorater_prompt_parameters": autorater_prompt_parameters,
        "response_column_a": response_column_a,
        "response_column_b": response_column_b,
        "model_a": model_a,
        "model_a_prompt_parameters": model_a_prompt_parameters,
        "model_b": model_b,
        "model_b_prompt_parameters": model_b_prompt_parameters
    },
    pipeline_root=os.path.join(output_dir, pipelinejob_displayname)
)

pipeline_job.run()

In this script:

Replace YOUR_PROJECT_ID, YOUR_OUTPUT_DIR, YOUR_BUCKET/YOUR_DATASET.jsonl, MODEL_A_ID, and MODEL_B_ID with your specific project ID, Cloud Storage paths, and model IDs.
Adjust task, id_columns, autorater_prompt_parameters, response_column_a, and response_column_b according to your evaluation setup.

This code initializes the Vertex AI environment with your project and location, sets up the pipeline job with the specified parameters, and runs the job. The results of the evaluation will be stored in the specified output directory.

Viewing Evaluation Results in Vertex AI Pipelines

After running your model evaluation job using Vertex AI Auto SxS, you can access the results through Vertex AI Pipelines. The evaluation results are encapsulated in several key artifacts generated by the AutoSxS pipeline:

Judgments Table: Created by the AutoSxS arbiter, this table provides example-level metrics. It helps you gauge model performance for each example and includes information like inference prompts, model responses, autorater decisions, rating explanations, and confidence scores. This data can be stored in either JSON format in Cloud Storage or as a BigQuery table. Key columns in the Judgments table include:

ID columns to identify unique evaluation examples.
Inference instruction and context used for generating model responses.
Responses from Model A and Model B.
The ‘choice’ column indicating the model with the superior response.
Confidence scores and explanations provided by the autorater.

2. Aggregate Metrics: Produced by the AutoSxS metrics component, these metrics offer an overview of the evaluation, such as the win rates for each model. It shows the percentage of times each model was preferred by the autorater. These metrics are particularly useful to understand the overall performance of the models in comparison.

3. Human-Preference Alignment Metrics: If your evaluation includes human-preference data, AutoSxS will also provide metrics that compare the autorater’s decisions with human preferences. This includes metrics like the win rates according to human preferences and the autorater, along with statistical measures like accuracy, precision, recall, F1 score, and Cohen’s Kappa. These metrics help in understanding how closely the autorater’s decisions align with human judgments.

By examining these results, especially the row-based data and autorater explanations, you can gain a comprehensive understanding of each model’s performance and how they compare to human preferences. This in-depth analysis aids in making informed decisions about model improvements and selections.

Best Practices and Tips (Maximizing Auto SxS Efficiency)

Recommendations for Efficient Usage

To get the most out of Auto SxS, consider:

Regularly updating your evaluation datasets to reflect current trends and data.
Using a diverse range of prompts that cover various aspects of your model’s intended use case.

Avoiding Common Pitfalls

Common pitfalls include:

Overlooking the importance of varied and comprehensive datasets.
Neglecting to fine-tune the autorater’s criteria to match specific evaluation needs.

Advanced Features and Customization

Exploring Lesser-Known Functionalities

Auto SxS also offers advanced features such as:

Customizable evaluation criteria for specialized tasks.
Detailed analysis reports that provide deeper insights into model performance.

Customizing Evaluations for Specific Needs

You can tailor Auto SxS evaluations by:

Adjusting the autorater’s criteria based on the specific nuances of your models.
Creating custom prompts that closely align with your models’ real-world applications.

Conclusion

In this deep dive into Google Cloud’s Vertex AI Auto SxS, we’ve uncovered its robust capabilities in evaluating and comparing AI models. This tool is more than just a technical asset; it’s a catalyst for innovation and efficiency in AI model development and deployment. As AI continues to evolve, tools like Auto SxS will undoubtedly play a pivotal role in shaping the future of AI and cloud computing.

References

Perform automatic side-by-side evaluation | Vertex AI | Google Cloud

Preview This feature is a Preview offering, subject to the Pre-GA Offerings Terms of the GCP Service Specific Terms…

cloud.google.com

About me — I am a GCP Cloud Architect with over a decade of experience in IT industry. A multi-cloud certified professional. If you got any question, you can reach me on LinkedIn and twitter @jitu028 and DM, I’ll be happy to help!!

You can also schedule 1:1 discussion with me on https://www.topmate.io/jitu028 for any Cloud related support.

Appreciate the technical knowledge shared? Support my work by buying me a book. Just scan the QR code below to make a difference.