LLM Comparator: A tool for human-driven LLM evaluation
By Minsuk Kahng, Ryan Mullins, Ludovic Peran
Evaluating a Generative AI model, like Gemma 2 –whose upcoming release was announced today at Google I/O — requires assessing its outputs. That’s pretty straightforward for models that analyze quantitative datasets to predict the weather or the price of corn, because you can see how accurate their predictions are. But when you’re dealing with a model that’s responding to users’ written prompts, responses need to be more than just accurate. They also have to be clear, respectful, and contain an appropriate level of detail. That’s why we’re proud to introduce the LLM Comparator, the latest tool in Google’s Responsible Generative AI Toolkit that helps developers deploy their models with safety and responsibility top of mind.
Whom it’s for
The LLM Comparator is an interactive and visual tool for side-by-side evaluations of the quality and safety of an LLM’s responses. It not only provides quantitative assessments, it also provides qualitative descriptions of the results and enables developers to point and click their way down to particular responses so they can understand the language behind the metrics.
This tool is primarily for developers who are trying to see if a model tweaked this way or that is actually better than its prior version. It can also be used to evaluate two different models, or even different prompting strategies. The aim is to help developers get information they need to make LLMs more helpful and safe.
Here’s how it works
Many teams developing LLMs have standard sets of prompts they use to evaluate the current build of the model. The issue has been that ideally you want to generate lots of responses, and have uniform standards for testing their quality and safety. At scale, that can be very difficult. So, LLM Comparator leverages the idea to have another LLM evaluate the results and provides the user with tools to explore, validate, and draw conclusions from this analysis.
To illustrate the power of the LLM Comparator, we’ve built a demo that compares the performance of two versions of the Gemma 7B instruction tuned model — 1.0 and 1.1. For the responses we use a sample of the Chatbot Arena Conversations dataset, a publicly available dataset of 33,000 conversations collected on the LMSYS Chatbot Arena.
Categories, simple language explanations and custom functions
In our demo case, the LLM evaluator (also known as a “judge,” as described by the original academic researchers behind the LLM Comparator concept) — here, Gemini Pro — says Gemma 1.1 is overall significantly better than Gemma 1.0.
But what exactly makes the new Gemma model better, beyond the better scores on standard benchmarks? Before LLM Comparator, answering this crucial question would have been very difficult.The LLM Comparator enables users to analyze performance via categories created by developers and shows, side-by-side, how well each model did within each category.
This is a first step to get a more granular understanding of the performance. If developers are creating an LLM focused on, say, scientific research, or the humanities, this can be crucial information that complement aggregate scores.
The LLM Comparator enables users to analyze performance via categories created by developers and shows, side-by-side, how well each model did within each category .
But in what way are the responses in a category better in one model than the other? Another crucial question! Because Gemini Pro is an LLM, it can present its analysis in plain terms: for example, Gemma 1.1 is more detailed, more accurate and better organized.
If you want to investigate further, simply click to explore the prompts and responses themselves; ultimately, it’s important to have a human in the loop, especially when issues may be delicate, nuanced, and concern safety issues.
Another reason humans are needed in the loop: Developers may have particular interests that the AI evaluator may not have anticipated. So, the LLM Comparator lets users define custom functions to check for specific elements in the model responses. Thanks to this feature, our developers discovered that Gemma 1.0 was beginning too many responses with “Sure!” (It no longer does.)
It’s important to have a human in the loop, especially when issues may be delicate, or nuanced, and concern safety issues.
These custom functions can, of course, uncover more important issues. What’s the average reading level of the responses marked as clear? Do they often contain bulleted lists to structure the ideas? Are they using the sorts of pleasantries that convey respect? Are they overdoing it with the pleasantries? And even, what is the breakdown of the pronouns used?
To summarize
As model performance increases, it is important to help developers assess model capabilities beyond the aggregated scores of benchmarks.
Bringing a human-in-the-loop approach to automated side-by-side evaluation allows one to perform the required granular analysis to evaluate models. We believe that the LLM Comparator is critically helpful in this regard and we’re excited to share this with developers. You can find it and more tools in our newly updated Responsible Generative AI Toolkit.
This work is the results of a large group, including:
Lucas Dixon, Michael Terry, Ian Tenney, Mahima Pushkarna, Michael Xieyang Liu, James Wexler, Emily Reif, Krystal Kallarackal, Minsuk Chang, Tolga Bolukbasi, David Weinberger, Surya Bhupatiraju, Kathleen Kenealy, Reena Jana, Devki Trivedi