LLM Evaluation metrics explained

ROUGE score, BLEU, Perplexity, MRR, BERTScore maths and example

Mehul Gupta
Data Science in your pocket

--

Evaluating LLMs has always been an important point of discussion ever since Generative AI has come into the limelight. In this post, I would be covering some of the most important metrics (other than Accuracy, F1-score) that are used for evaluating LLMs and setting benchmarks.

My debut book: LangChain in your Pocket is out now!

This post will deep dive into the following metrics

Perplexity

BLEU

ROUGE (ROUGE-N, ROUGE-L, ROUGE-W, ROUGE-S, ROUGE-SU)

MRR

BERTScore

This is a veryyy longggg possttt🤪

Perplexity

Don’t confuse it with ChatGPT’s rival, Perplexity but it is a key metric used to evaluate how well a language model predicts a sequence of words in the answer. Also, it doesn’t require Ground Truth!

Perplexity measure of how “perplexed” or “confused” a model is when it predicts the next word in a sequence.

A lower perplexity means the model is less perplexed, implying it is better at predicting the next word.

Conversely, a higher perplexity indicates more confusion, meaning the model struggles to predict the next word correctly.

So,

If Perplexity=1, the model perfectly predicted the sequence with 100% accuracy.

Perplexity=10 can be interpreted as the model having an average of 10 equally likely options at each point in the sequence. Hence highly confused.

How is it calculated?

exp(- (1/N) * Σ log(P(w_i | w_1, w_2, …, w_i-1)))

P(w_i | w_1, w_2, …, w_i-1) = the conditional probability assigned by the language model to the i-th word (w_i) in the sequence, given the previous words (w_1, w_2, …, w_i-1) as context.

N=Length

Assume the LLM predicted “The cat sat on the mat.” for some prompt.

Step 1: Calculate probabilities for each word given the previous words. For this example, let’s assume the below values:

P(“The”) = 0.5
P(“cat”|”The”) = 0.4
P(“sat”|”The cat”) = 0.3
P(“on”|”The cat sat”) = 0.4
P(“the”|”The cat sat on”) = 0.5
P(“mat”|”The cat sat on the”) = 0.6

Step 2: Apply log and add these probabilities together

log(P(“The”)) +
log(P(“cat”|”The”)) +
log(P(“sat”|”The cat”)) +
log(P(“on”|”The cat sat”)) +
log(P(“the”|”The cat sat on”)) +
log(P(“mat”|”The cat sat on the”)) = abc

Step 3: Average the above log summation (divide by total words) and apply exponential

exp(abc/6) = 2.275 (let’s assume for now)

Hence, Perplexity=2.275 meaning the model need to choose between about 2.275 words to choose the next word in the sequence.

Note: To calculate perplexity, you need to have probabilities of prediction available with you. Hence, not useful when using an API like OpenAI or Anthropic where probability scores aren’t accessible.

For HuggingFace loaded models, you can use the evaluate package using the below code where you can change the model's name.

Next up we have

BLEU score

Another very popular metric, BLEU (Bilingual Evaluation Understudy) evaluates how closely the output resembles the Ground Truth (hence a Ground Truth is required unlike Perplexity) majorly used with Machine Translation problems.

BLEU Score = BP * exp(∑(weightᵢ* log(pᵢ)))

Where

BP= Brevity Penalty

weightᵢ = weight assigned to Precision of each n-gram

pᵢ = Precision for each n-gram

Let’s understand the math's and each of the term used following an example:

Assume,

Reference Sentence: “The cat is on the mat.”

Candidate Sentence: “The cat sat on the mat.”

Step 1: Calculate n-gram precision.

Assuming we go with n=2

Hence, we need to calculate precision at 1-gram & 2-gram precision. If n=3, we would be calculating for 1-gram,2-gram & 3-gram

1-gram Precision

Reference: The, cat, is, on, the, mat

Generated: The, cat, sat, on, the, mat

Matches: The, cat, on, the, mat (5)

=Number of matches / Total unigrams in Generated= 5 / 6

2-gram Precision

Reference: The cat, cat is, is on, on the, the mat

Generated:: The cat, cat sat, sat on, on the, the mat

Matches: The cat, on the, the mat (3)

=Number of matches / Total bigrams in Generated= 3 / 5 = 0.6

Step 2: Brevity Penalty (BP)

Brevity Penalty is used to penalize short candidate sentences.

BP = 1 if candidate length ≥ reference length,

BP = exp(1 — reference length / candidate length)

if candidate length < reference length.

In this case, both the reference and candidate sentences have 6 words, so: BP = 1

Now, we have everything required to calculate BLEU i.e. BP term and n-gram Precision values. Weights are usually equal for every n-gram hence assume weightsᵢ=0.5 (the total of weights should equal to 1)

BLEU = 1*exp(0.5*log(5/6) + 0.5*log(0.6))=0.86

ROUGE Score

Similar to BLEU score, ROUGE Score is also a metric that requires Ground Truth and has multiple versions as well. Also, it isn’t a single score but we would be calculating Recall, Precision & F1 which would be called as ROUGE-Recall, ROUGE-Precision & ROUGE-F1. Some commonly used version of ROUGE Score are

ROUGE-N: Measures the overlap of n-grams (like bigrams, trigrams) between the generated text and the reference text.

ROUGE-L: Focuses on the longest common subsequence (LCS) of words, highlighting sequence similarity.

ROUGE-W: Weights the LCS based on its length, giving more credit to longer matches.

ROUGE-S: Looks at skip-bigrams, pairs of words that maintain their order but may have gaps in between.

ROUGE-SU: Looks at skip-bigrams, pairs of words that maintain their order but may have gaps in between.

Let’s quickly understand the mathematics behind each of these variants

Assume,

Ground Truth : The cat sat on the mat

Predicted Truth : The cat lay on the mat

Note: We will be using the same example for all ROUGE explanations

ROUGE-N

ROUGE-1 (unigrams):

  • Unigrams in Reference: {The, cat, sat, on, the, mat}
  • Unigrams in Generated: {The, cat, lay, on, the, mat}
  • Overlap: {The, cat, on, the, mat}
  • Precision: 5/6 (generated unigrams matched / total generated unigrams)
  • Recall: 5/6 (reference unigrams matched / total reference unigrams)
  • F1 Score: 5/6 (since precision and recall are equal)

ROUGE-2 (bigrams):

  • Bigrams in Reference: {The cat, cat sat, sat on, on the, the mat}
  • Bigrams in Generated: {The cat, cat lay, lay on, on the, the mat}
  • Overlap: {The cat, on the, the mat}
  • Precision: 3/5
  • Recall: 3/5
  • F1 Score: 3/5

ROUGE-L

For ROUGE-L, we need to calculate Longest Common Subsequence i.e. a sequence that appears in the same order in both sequences but not necessarily consecutively. It’s used to measure the similarity between two sequences by identifying the longest subsequence common to both.

  • Longest Common Subsequence (LCS): “The cat on the mat”
  • Precision: 5/6 (LCS length / generated text length)
  • Recall: 5/6 (LCS length / reference text length)
  • F1 Score: 5/6

ROUGE-S

ROUGE-S measures the overlap of skip-bigrams between the generated text and the reference text.

Skip-bigrams are pairs of words that appear in the same order in both texts, but they do not have to be consecutive. This allows for a more flexible measure of text similarity.

  • Skip-bigrams in Reference: {The cat, The sat, The on, The the, The mat, cat sat, cat on, cat the, cat mat, sat on, sat the, sat mat, on the, on mat, the mat}
  • Skip-bigrams in Generated: {The cat, The lay, The on, The the, The mat, cat lay, cat on, cat the, cat mat, lay on, lay the, lay mat, on the, on mat, the mat}
  • Overlap: {The cat, The on, The the, The mat, cat on, cat the, cat mat, on the, on mat, the mat}
  • Precision: 10/15
  • Recall: 10/15
  • F1 Score: 10/15

ROUGE-SU

ROUGE-SU is an extension of ROUGE-S that includes both skip-bigrams and unigrams (individual words) in its evaluation. This helps to give some credit even when there are no matching word pairs, which ROUGE-S alone might miss.

ROUGE-S focuses on pairs of words that appear in both the generated and reference texts. If no such pairs exist, ROUGE-S gives a score of zero, even if the sentences are somewhat similar.

Example:

  • Reference Sentence (S1): “The police killed the gunman.”
  • Generated Sentence (S5): “gunman the killed police”

In this case, S5 is the exact reverse of S1, so there are no matching word pairs, and ROUGE-S would score it as 0.

In case of ROUGE-SU, apart from bigrams, even unigram matches are also considered hence ROUGE-SU won’t be 0 in the above case.

ROUGE-W

ROUGE-W (ROUGE-Weighted Longest Common Subsequence) improves on the basic LCS method by giving more weight to consecutive word matches, rewarding sequences that match more closely in order. The idea is simple: Assign more weightage to longer sub-parts of the LCS

The Problem with Basic LCS: Basic LCS measures the length of the longest sequence of words that appear in the same order in both texts but doesn’t differentiate between consecutive and scattered matches.

Example:

  • Reference Text (X): [A B C D E F G]
  • Generated Text 1 (Y1): [A B C D H I K]
  • Generated Text 2 (Y2): [A H B K C I D]

Both Y1 and Y2 have the same LCS length of 4 (A B C D), but Y1 should be considered better because it has consecutive matches.

ROUGE-W improves this by assigning higher weights to longer consecutive matches. Here, we need to calculate Weighted LCS. Let’s understand how WLCS is calculated:

Reference Text (X): [A B C D E F G]
Generated Text 1 (Y1): [A B C D H I K]
Generated Text 2 (Y2): [A H B K C I D]

For Y1:

Matches: A B C D (consecutive 4 matches)

Weighted LCS: f(4)

For Y2:

Matches: A B C D (not consecutive, each letter is separate hence 4 subparts of length=1 in this sub-sequence)

Weighted LCS: f(1)+f(1)+f(1)+f(1)

Where f() is the weight assignment function.

Note: Choose a function such that f(x+y)>f(x)+f(y)

So, if f(x)=x²,

Y1 Weighted LCS=4²=16

Y2 Weighted LCS=1²+1²+1²+1²=4

ROUGE-W Recall: f⁻¹ (WLCS(X,Y1))/len(X)

ROUGE-W Precision: f⁻¹ (WLCS(X,Y1))/len(Y1)

Where f⁻¹=inverse of f() i.e. √x

MRR

The easiest of the lot, The Mean Reciprocal Rank (MRR) metric checks how good the LLM is at putting the right answer near the top of the list. Hence a good metric for Classification or Retrieval tasks.

Consider a simple example:

  1. You ask a question: “What’s the capital of France?”
  2. LLM answers (ranked):

1st: London

2nd: Paris

3rd: Berlin

The correct answer, “Paris,” is in the 2nd position.

To calculate the Reciprocal Rank (RR) for this answer:

  • Reciprocal Rank = 1/Position of correct answer
  • So, RR = 1/2 = 0.5

If you do this for several questions, you calculate the RR for each one and then find the average. That average is the MRR. If the assistant often gets the right answer in the top spots, the MRR will be high, close to 1. If it rarely does, the MRR will be lower, closer to 0.

BERTScore

BERTScore uses a pre-trained BERT model to evaluate the quality of generated text compared to a reference text. It measures the semantic similarity between the two texts using BERT embeddings.

1. Tokenization and Embedding Generation

Tokenization: Both the generated text and the reference text are tokenized into subword units (such as WordPieces in BERT) and converted into token IDs that the BERT model can understand.

BERT Embeddings: The token IDs of both texts are fed into a pre-trained BERT model. BERT converts each token into a high-dimensional vector (embedding) that captures its contextual meaning based on the surrounding tokens.

2. Cosine Similarity Calculation

Cosine Similarity: After obtaining the embeddings for the generated and reference texts, cosine similarity is calculated between each pair of embeddings (token-wise).

Similarity Matrix: A similarity matrix is created where each element represents the cosine similarity between the corresponding tokens from the generated and reference texts.

3. Best Matching Strategy

Best Matching: BERTScore uses a greedy algorithm to find the best matching between tokens in the generated text and tokens in the reference text. The goal is to maximize the overall similarity score.

Optimal Matching: The algorithm ensures that each token in the generated text is matched to at most one token in the reference text, and vice versa.

4. Precision, Recall, and F1 Score Calculation

BERT-Precision: Average similarity score for each token in the generated text to the closest token in the reference text.

BERT-Recall: Average similarity score for each token in the reference text to the closest token in the generated text.

BERT-F1 Score: Harmonic mean of precision and recall.

Let’s illustrate this with an example:

Reference text: “The cat sat on the mat.”

Generated text: “A cat was sitting on the mat.”

STEP 1: Tokenization and Embedding Generation

  • Reference tokens: [“The”, “cat”, “sat”, “on”, “the”, “mat”, “.”]
  • Generated tokens: [“A”, “cat”, “was”, “sitting”, “on”, “the”, “mat”, “.”]

STEP 2: Embedding Generation for these tokens using BERT model

STEP 3: Create Similarity matrix using Cosine-Similarity between each token’s embedding. Something similar to below image

STEP 4: Get best matching pairs from reference and generated text. Assume, you got

(“A”, “The”), (“cat”, “cat”), (“was”, “sat”), (“sitting”, “sat”), (“on”, “on”), (“the”, “the”), (“mat”, “mat”)

STEP 5: Calculating BERTScore:

  • BERT-Precision: Average similarity for tokens in generated text.
  • BERT-Recall: Average similarity for tokens in reference text.
  • BERT-F1: 2 * (Precision * Recall) / (Precision + Recall)

With this, we will wrap this up. There are many other metrics that you can explore apart from the ones mentioned here. Hope to see you soon

--

--