Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Follow publication

Member-only story

Towards Unbiased Evaluation of Large Language Models

7 min readDec 9, 2023

--

Image by author. (AI-assisted)

“Our new LLM beats GPT in every benchmark!”

It is becoming increasingly common to hear bold claims like this, as the hype around LLMs is huge. There are new models every week, and currently everyone is trying to compete with GPT-4, which is still the most powerful LLM.

Benchmarking is a critical part of evaluating progress in large language models.

Benchmarks like MMLU and HellaSwag are the standard for assessing language models on skills like reasoning and comprehension. The scores provide a snapshot of progress, with new state-of-the-art results heralded as breakthroughs. LLMs are usually evaluated in a zero-shot setting, without explicit training on the test set, to gauge their general abilities.

This article shows how easy it is to manipulate benchmark results and offers suggestions to maintain evaluation integrity.

The Trouble with Benchmarks

Often, benchmarks don’t reflect usefulness in real-life scenarios. Google’s newest model, Gemini Ultra, scores 90.04% on MMLU. While this is an impressive score, taking a closer look at the evaluation methodology, it is CoT@32

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Donato Riccio
Donato Riccio

Written by Donato Riccio

AI Engineer specialized in Large Language Models.