Member-only story
Towards Unbiased Evaluation of Large Language Models
How benchmark leakage and data contamination undermine LLMs evaluation
“Our new LLM beats GPT in every benchmark!”
It is becoming increasingly common to hear bold claims like this, as the hype around LLMs is huge. There are new models every week, and currently everyone is trying to compete with GPT-4, which is still the most powerful LLM.
Benchmarking is a critical part of evaluating progress in large language models.
Benchmarks like MMLU and HellaSwag are the standard for assessing language models on skills like reasoning and comprehension. The scores provide a snapshot of progress, with new state-of-the-art results heralded as breakthroughs. LLMs are usually evaluated in a zero-shot setting, without explicit training on the test set, to gauge their general abilities.
This article shows how easy it is to manipulate benchmark results and offers suggestions to maintain evaluation integrity.
The Trouble with Benchmarks
Often, benchmarks don’t reflect usefulness in real-life scenarios. Google’s newest model, Gemini Ultra, scores 90.04% on MMLU. While this is an impressive score, taking a closer look at the evaluation methodology, it is CoT@32…