Member-only story
Featured
LLM Leaderboard Illusion
I stopped believing in the AI benchmarks a long time ago. I have read multiple papers that claimed things that were not true. And the best example of this is the “Sparks of AGI paper”. I kept wondering, how come we are improving on so many benchmarks so fast? And yet, in my personal usage, my productivity barely went up. So, finally, we have a paper that shows exactly what scam is going on in the LLM evaluation space.
Table Of Contents
- Understanding Benchmarking
- Importance Of Correct Benchmarking Before We Reach AGI
- The Ongoing Scam Of Benchmarking
- The Leaderboard Illusion
- Final Thoughts
Understanding Benchmarking
Benchmarking is a very important aspect of releasing large-scale AI models, something which will be used by millions of people. Every benchmark is supposed to measure the ability of a model on a certain set of topics. Some benchmarks are designed to understand a model’s capabilities, whereas others are for understanding a model’s limitations.
General Knowledge and Reasoning Benchmarks
- MMLU (Massive Multitask Language Understanding) is designed to test a model’s knowledge across 57 academic and professional…