Member-only story
Benchmarking Our Path to AGI: Measuring AI Progress in 2025
What is the state of play with AI in early 2025? Are we in an S-curve of diminishing returns, or actually at the early stages of an exponential takeoff?
Well, the main method we use to gauge progress is “evals”, or evaluations. By definition, this is about choosing metrics by which to measure how well any given AI model performs. Remember, the models themselves are initially just trained to predict the next token. Whether those tokens are poetry, math, code, or a combination of everything are in part, albeit not entirely, determining the capabilities that come out at the end of training.
To compare model performance across time, researchers have come up with benchmarks that are some form of question and answer and therefore can be measured as a percentage from 0 to 100. Others measure preferences by comparing model outputs one-to-one, and this is usually measured via metrics such as win rate.
Currently, we’re in a kind of trench warfare between a handful of AI labs. For the longest time, OpenAI generally owned all the major benchmarks. Whenever other labs released new models, they had to cherry-pick some esoteric benchmark to show the public, and investors, that they were state-of-the-art (“SOTA”) in something.
Perhaps the one that still carries the biggest emotional weight is Chatbot Arena, where the public can blind-compare models simply by choosing which response they prefer. You could say this is the people’s champion, in that…