LLM Comparator: A tool for human-driven LLM evaluation

Published in

People + AI Research

5 min readMay 14, 2024

By Minsuk Kahng, Ryan Mullins, Ludovic Peran

LLM Comparator’s interface features a table on the left displaying input prompts, model outputs, score distributions, and rationales. A column on the right showcases visualizations like score distributions, metrics, clusters, and custom functions. Users can interact with both the table and visualizations to drill down into specific data points and filter results. — *LLM Comparator’s user interface*

Evaluating a Generative AI model, like Gemma 2 –whose upcoming release was announced today at Google I/O — requires assessing its outputs. That’s pretty straightforward for models that analyze quantitative datasets to predict the weather or the price of corn, because you can see how accurate their predictions are. But when you’re dealing with a model that’s responding to users’ written prompts, responses need to be more than just accurate. They also have to be clear, respectful, and contain an appropriate level of detail. That’s why we’re proud to introduce the LLM Comparator, the latest tool in Google’s Responsible Generative AI Toolkit that helps developers deploy their models with safety and responsibility top of mind.

Whom it’s for

The LLM Comparator is an interactive and visual tool for side-by-side evaluations of the quality and safety of an LLM’s responses. It not only provides quantitative assessments, it also provides qualitative descriptions of the results and enables developers to point and click their way down to particular responses so they can understand the language behind the metrics.

This tool is primarily for developers who are trying to see if a model tweaked this way or that is actually better than its prior version. It can also be used to evaluate two different models, or even different prompting strategies. The aim is to help developers get information they need to make LLMs more helpful and safe.

Here’s how it works

Many teams developing LLMs have standard sets of prompts they use to evaluate the current build of the model. The issue has been that ideally you want to generate lots of responses, and have uniform standards for testing their quality and safety. At scale, that can be very difficult. So, LLM Comparator leverages the idea to have another LLM evaluate the results and provides the user with tools to explore, validate, and draw conclusions from this analysis.

To illustrate the power of the LLM Comparator, we’ve built a demo that compares the performance of two versions of the Gemma 7B instruction tuned model — 1.0 and 1.1. For the responses we use a sample of the Chatbot Arena Conversations dataset, a publicly available dataset of 33,000 conversations collected on the LMSYS Chatbot Arena.

Categories, simple language explanations and custom functions

In our demo case, the LLM evaluator (also known as a “judge,” as described by the original academic researchers behind the LLM Comparator concept) — here, Gemini Pro — says Gemma 1.1 is overall significantly better than Gemma 1.0.

But what exactly makes the new Gemma model better, beyond the better scores on standard benchmarks? Before LLM Comparator, answering this crucial question would have been very difficult.The LLM Comparator enables users to analyze performance via categories created by developers and shows, side-by-side, how well each model did within each category.

This is a first step to get a more granular understanding of the performance. If developers are creating an LLM focused on, say, scientific research, or the humanities, this can be crucial information that complement aggregate scores.

The distribution is a bar chart ranging from +1.50 to -1.50 in intervals of 0.5. In this dataset, the 383 outputs for which model A is better are encoded in blue, with range 0.25 to +1.50. The 174 outputs for which model B is better are encoded in orange with range -0.25 to -1.50. 343 outputs from A and B are somewhat similar, and are encoded in gray, with range +0.25 to -0.25. The population mean is 0.20. The metrics by prompt category chart is a table with 2 columns: average score and win rate — *LLM Comparator’s score distribution shows that Gemma 1.1 (here model A) is better than Gemma 1.0 on 61.6% of times-as indicated in the first row of the second panel.*

The LLM Comparator enables users to analyze performance via categories created by developers and shows, side-by-side, how well each model did within each category .

But in what way are the responses in a category better in one model than the other? Another crucial question! Because Gemini Pro is an LLM, it can present its analysis in plain terms: for example, Gemma 1.1 is more detailed, more accurate and better organized.

Screenshot of the LLM Comparator’s user interface. It shows the plain text explanation as provided by Gemini Pro explaining why Gemma 1.1 is better than 1.0. The rationale summary is visualized as a table with in-line horizontal bar charts that visulize the counts for which model A and model B were better for different cluster lables. For the largest cluster label rationale, “provides more detail”, model A is better for 131 examples, and model B is better for 60 examples. — Plain text explanations by another LLM, here Gemini Pro, provide a qualitative analysis that help developers gain a much finer understanding of the differences between the two models being compared

If you want to investigate further, simply click to explore the prompts and responses themselves; ultimately, it’s important to have a human in the loop, especially when issues may be delicate, nuanced, and concern safety issues.

Another reason humans are needed in the loop: Developers may have particular interests that the AI evaluator may not have anticipated. So, the LLM Comparator lets users define custom functions to check for specific elements in the model responses. Thanks to this feature, our developers discovered that Gemma 1.0 was beginning too many responses with “Sure!” (It no longer does.)

It’s important to have a human in the loop, especially when issues may be delicate, or nuanced, and concern safety issues.

These custom functions can, of course, uncover more important issues. What’s the average reading level of the responses marked as clear? Do they often contain bulleted lists to structure the ideas? Are they using the sorts of pleasantries that convey respect? Are they overdoing it with the pleasantries? And even, what is the breakdown of the pronouns used?

Screenshot of the LLM Comparator’s user interface. It shows answers from the two models being compared as well as the ‘custom function’ feature. A table on the left shows responses from model A in a column under a blue header, responses from model B in a column under an orange header, and a distribution of scores for each row. A purple chip on model B shows the use of a third LLM to determine if or not the output from Model B starts with “Sure” using a binary true/false label. — *The custom function feature, displayed on the right, allowed us to identify language patterns related to chattiness and structure between Gemma 1.1 and 1.0.*

To summarize

As model performance increases, it is important to help developers assess model capabilities beyond the aggregated scores of benchmarks.

Bringing a human-in-the-loop approach to automated side-by-side evaluation allows one to perform the required granular analysis to evaluate models. We believe that the LLM Comparator is critically helpful in this regard and we’re excited to share this with developers. You can find it and more tools in our newly updated Responsible Generative AI Toolkit.

This work is the results of a large group, including:

Lucas Dixon, Michael Terry, Ian Tenney, Mahima Pushkarna, Michael Xieyang Liu, James Wexler, Emily Reif, Krystal Kallarackal, Minsuk Chang, Tolga Bolukbasi, David Weinberger, Surya Bhupatiraju, Kathleen Kenealy, Reena Jana, Devki Trivedi

LLM Comparator: A tool for human-driven LLM evaluation

Whom it’s for

Here’s how it works

Categories, simple language explanations and custom functions

To summarize

Written by People + AI Research @ Google