The Benchmark Trap: Why LLM Metrics Mislead and Evals Enlighten

Navigating the Pitfalls of Current Benchmarks and the Importance of Robust Evaluation for AI Systems

Ribhu Lahiri
8 min readMay 12, 2024

--

Generated by Bing Image Creator (DALL·E 3)

“We have just launched our newest large language model trained with the best curated dataset with the state-of-the-art GPUs and we have improved on the MMLU by n number of points, surpassing the previous best.” Does this sound familiar? Reads like a press release of an AI company releasing their shiny new LLM for the world. As the arms race heats up it was bound to happen. But there’s one big problem with this line. No, I’m not talking about the obvious x-risk with larger, uninterpretable, and unaligned models. I’m not talking about the effects on climate change due to the the emissions of data centres. I want to illustrate one issue that’s not talked about as much but is actually much easier to control and to take action on.

I’m talking about benchmarks and evals. And I want to start with the elephant, or rather the whale in the room. The MMLU. It is treated as the gold standard of LLM benchmarks. Whenever a company releases their newest offering they boast about how much they have improved on it. Its almost as if getting a 100 means we have achieved AGI. But that’s far from the truth. Look at the following questions:

--

--