Start Evaluating Your LLMs Today: A Practical Look at Vertex AI’s Rapid Evaluation Framework

Published in

AITech

5 min readApr 16, 2024

Introduction

Large Language Models (LLMs) are having their moment. They can write surprisingly creative stories, translate between languages like a pro, answer your trickiest questions, and even generate code. But as these powerful AI tools find their way into more and more products, one question becomes crucial: How do we know if they’re really doing a good job?

That’s where a reliable evaluation framework comes in. Think of it like a report card for your LLM. It helps you compare different models, figure out if your prompts are guiding the LLM correctly, and generally make sure everything’s up to scratch before you roll out an LLM-powered application.

Challenges of LLM Evaluation

Imagine asking an LLM to write a poem. It does… But is the poem any good? The answer to such a question is often not very clear-cut. Evaluating an LLM’s output can be tricky. Here’s why:

The Creativity Conundrum: What makes something creative? It’s subjective! One reader might be dazzled by the unexpected word choices, while another finds it nonsensical. There’s no single “right” answer.
Beyond Just Grammar: LLMs can churn out perfectly grammatical text that still misses the mark. Did it capture the right tone? Does it make sense in the broader context? Is it actually interesting to read?
The Bias Blindspot: The data LLMs are trained on can be riddled with societal biases. An evaluation framework needs to be alert to these potential biases to ensure that the LLM isn’t inadvertently perpetuating harmful stereotypes.

Vertex AI’s Rapid Evaluation Framework

Vertex AI’s Rapid Evaluation Framework provides a structured and efficient approach to the evaluation of Large Language Models (LLMs). This framework smoothens the evaluation process by automating repeated runs across multiple model configurations and prompt templates, while integrating with Vertex AI Experiments for streamlined tracking.

Core Functionalities

Automated Metric Computation: The Framework simplifies the process by automatically calculating relevant metrics, eliminating the need for manual implementation. This includes both model-based metrics (e.g., assessing summarization quality, coherence, fluency, and safety) and computation-based metrics (e.g., BLEU and ROUGE for similarity assessment).
Experiment Management: Seamless integration with Vertex AI Experiments facilitates the tracking of evaluation settings and results across multiple runs. This supports robust comparisons, prompt tuning, and model selection.

Key Features

Comprehensive Text Generation Assessment: The Framework supports the evaluation of various text generation tasks with metrics that capture aspects such as summarization effectiveness, logical coherence, linguistic fluency, and output safety.
Reference-Based Evaluation: Traditional textual similarity metrics like BLEU and ROUGE are offered for cases where ground-truth datasets are available.
Evaluating Tool Integration: When LLMs infer the use of external tools, the Framework can assess the accuracy of these tool calls and the correct identification of necessary parameters.

Step-by-Step Guide: Evaluating LLMs with Vertex AI’s Rapid Evaluation Framework

1. Installation

Start by installing the Rapid Evaluation component from the Vertex AI SDK:

pip install --upgrade google-cloud-aiplatform[rapid_evaluation]

2. Prepare Your Evaluation Dataset

Structure: Your evaluation dataset should be a Pandas DataFrame with three columns: instruction, context, and reference.

instruction: Describes the LLM task for each sample (e.g., "Summarize the following article").

context: Contains instances of input text (e.g., the article to be summarized).

reference: Provides the ideal output (e.g., the expected summary).

Code Example:

import pandas as pd

# ... (Your code for defining instruction, context, and reference)

eval_dataset = pd.DataFrame(
   {
       "context": context,
       "instruction": [instruction] * len(context),
       "reference": reference,
   }
)

3. Select Evaluation Metrics

Determine the metrics that align with your evaluation goals. The framework supports:

Model-based metrics: Assess fluency, coherence, safety, etc.
Computation-based metrics: Like BLEU and ROUGE for similarity to reference.

Code Example:

metrics = [
    "rouge_1",
    "rouge_l_sum",
    "bleu",
    "fluency",
    "coherence",
    "safety",
    # ... (Add other metrics as needed)
]

4. Create an Evaluation Task

Use the EvalTask class to define your evaluation task:

from google.cloud import aiplatform
from google.cloud.aiplatform.models.generative import tasks

experiment_name = "eval-model-generation-settings" 
summarization_eval_task = tasks.EvalTask(
    dataset=eval_dataset,
    metrics=metrics,
    experiment=experiment_name,
)

5. Define the Model Configurations

Provide a list of model generation configurations to be used in evaluation task.

generation_config_1 = {"max_output_tokens": 128, "temperature": 0.3}
generation_config_2 = {"max_output_tokens": 256, "temperature": 0.5}
generation_config_3 = {"max_output_tokens": 500, "temperature": 0.7}

gemini_1 = GenerativeModel("gemini-1.0-pro-001",generation_config=generation_config_1)
gemini_2 = GenerativeModel("gemini-1.0-pro-001",generation_config=generation_config_2)
gemini_3 = GenerativeModel("gemini-1.0-pro-001",generation_config=generation_config_3)

models = {
   "gemini-setting-1": gemini_1,
   "gemini-setting-2": gemini_2,
   "gemini-setting-3": gemini_3,
}

6. Experiment Setup and Execution

Iterate through model configurations, executing the evaluation task for each:

prompt_template = PromptTemplate("{instruction}. Article: {context}. Summary:")for model_name, model in models.items():
    experiment_run_name = f"eval-{model_name}-{run_id}"
    eval_result = summarization_eval_task.evaluate(
        model=model,
        prompt_template=prompt_template,
        experiment_run_name=experiment_run_name,
    )    # ... (Store results for comparison) 
    # add eval_result.summary_metrics, eval_result.metrics_table

Note: The provided code snippets are based on the notebook from GCP Generative AI Git Repo. Please refer to that notebook (link) for complete implementation details.

6. Analyze and Compare Results

Vertex AI conveniently logs results within Experiments. The included notebook demonstrates how to visualize and analyze these results to compare the performance of different models and parameter settings. Shown below are a couple of charts based on utility functions provided to view the relative model performance.