LLM Benchmarks: How can we say that LLaMa-2 is the best?

5 min readJul 27, 2023

When one of the bigger players in the AI field releases a new Large Language Model, a wave of excitation ripples through the tech world, especially if it turns out to be the best.

But how do we know that the new LLM is the best?

Well, we can always ask the model some questions and ask ourselves (or some of our friends) if we like it’s answers better, but… I might love it and my friend Andrew might hate it.

And it would be completely subjective.

This is where metrics on benchmarks come in, providing an objective measure of model performance.

What is a LLM benchmark?

By saying a LLM benchmark, we usually mean a dataset prepared to measure performance on a specific task.

Some examples of tasks:

Code generation
Common knowledge
Reasoning
Math

To benchmark the model we have to decide on one of the approaches to benchmarking:

Few-shot prompting — where model gets example questions alongside solutions inside of it’s prompt.
Zero-shot prompting — where model is prompted with the question only.

A perfect example is Hugging Face’s leaderboard that ranks open-source LLMs. This curated list objectively scores each model, offering a valuable resource for anyone looking to find the best available LLM. I highly recommend checking it out, before proceeding. You can find it here.

It looks like this:

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Models are ranked by the average of their performance on 4 datasets:

ARC (25-shot)
HellaSwag (10-shot)
MMLU (5-shot)
TruthfulQA (0-shot)

25-shot means 25 pairs of ( question, solution) from the dataset are inserted into the prompt for each question.

Let’s explore them one by one.

AI2 Reasoning Challenge — ARC

Introduced in early 2018 in paper called “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”.

Hugging Face describes it as “A set of grade-school science questions”.

In the paper we can find that it contains 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering.
The questions are designed to be answerable with reasoning and knowledge that a typical 8th grader would be expected to possess.

Dataset weights 681MB and is divided into 2 sets of questions:

ARC-Easy
ARC-Challenge

Example questions:

Some questions from ARC-Challenge train dataset

There’s a question, multiple multiple choices and a correct answer.

HellaSwag

This benchmarking dataset was released with paper “HellaSwag: Can a Machine Really Finish Your Sentence?” in May 2019.

It’s name is related to a previously existing dataset called SWAG. HellaSwag is more challenging for models to achieve high performance on.

This dataset is designed to evaluate models’ capabilities in the area of commonsense reasoning, particularly their ability to predict or complete a sentence in a way that makes sense.

Dataset weights 71.5MB.

Example questions:

Some questions from HellaSwag train dataset

Each element of the dataset is very well explained by the authors here.

Massive Multitask Language Understanding — MMLU

Published in early 2021 in paper Measuring Massive Multitask Language Understanding, this benchmark was designed to make the evaluation more challenging and similar to human evaluation.

The purpose of MMLU is to measure a model’s understanding and proficiency in various expert knowledge domains.

Contains questions from a 57 categories, some examples are:

Elementary mathematics
Abstract Algebra
Marketing
Nutrition
Moral disputes
US history

It has been observed that a human expert can achieve over 90% accuracy in his field, while GPT-4 has achieved 86.4% overall (using 5-shot)

The dataset is 8.48GB.

Example questions:

Some example questions from MMLU train datasets

This structure is very straightforward and is intuitively understandable.

TruthfulQA

Released in May 2022 in paper TruthfulQA: Measuring How Models Mimic Human Falsehoods. This is a benchmark to measure the truthfulness of a language model’s generated answers to questions.

This dataset is extremally interesting because the authors created questions that some humans might answer falsely due to misconceptions or false beliefs.

In order to score well, models must avoid generating false answers learned from imitating incorrect human texts present in pretraining data.

TruthfulQA measures two separate tasks:

Choosing correctly in a multiple choice question
Generating answer to a question with no proposed solutions

This dataset is the smallest weighting 1.15MB.

Example questions:

Questions from TruthfulQA generation dataset

Questions from TruthfulQA multiple-choice dataset

I recommend skimming through the whole dataset. The authors did a marvelous job finding areas where everyday humans would struggle to correctly answer all the questions.

It’s also helpful to clear your own misconceptions : )

Comparison

Let’s focus on what each dataset tried to measure and how did they come around that task.

We can see that GPT-4 is nearly at human performance for most tasks, while open-source models are still far behind it.

Will open-source models overtake commercial giants? Let me know what you think.

Thanks for reading! If you want to remember better I recommend you read this article not just once, but at least thrice :)

PS. Load datasets yourself and explore them using HuggingFace datasets

If you found this article helpful, consider following me for more.

It really encourages me to write.

Disclaimer: If you have found something that I can improve. I would be grateful if you’d reach directly to me at patryk@wieczorek.dev

LLM Benchmarks: How can we say that LLaMa-2 is the best?

What is a LLM benchmark?

AI2 Reasoning Challenge — ARC

HellaSwag

Massive Multitask Language Understanding — MMLU

TruthfulQA

Comparison

Written by Patryk M. Wieczorek