LLM — Evaluation

This article is a practical guide to measure the model performance and to make sure that the model gives good results after fine-tuning.

Pelin Balci
5 min readAug 1, 2023

It is simpler when we use numeric values in machine-learning tasks. We can easily calculate the accuracy, precision, recall, or mean squared error metrics. However, measuring the performance of generated texts is a bit challenging.

Let’s look at these two sentences:

original: I do love to go to school. 
prediction: I do not love to go to school.

There is only 1-word difference but the meaning completely changes. So, if we calculate the recall or precision by just looking at the number of predicted words it gives us a high number.

Remember that; while fine-tuning we split the dataset into train-validation and test sets. We will make these calculations on the test set.

Terminology

Let’s look at the terminology we will use for the calculations. It is not so hard:

  • unigram: one-word
  • bigram: two-words
  • n-gram: a group of n-words

Here are the two calculation methodologies we will talk about today: Rouge and Bleu metrics.

ROUGE

Recall-Oriented Understudy for Gisting Evaluation,[1] is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. [Wikipedia]

Rouge is case insensitive. It is used for text summarization. It can compare the summary to one or more reference summaries. And don’t forget: The order of the words is not matter for Rouge score. 🎈

Rouge-1 Recall = unigram matches / unigrams in reference

Rouge-1 Precision = unigram matches / unigrams in output

Rouge-1 Recall = 2 x (precision x recall) / (precision + recall)

Let’s look at these two examples:

Although the meaning completely changes in the second example, the scores are exactly the same.

Let’s look at another example and this time we will calculate the scores with bigrams:

The most common method to calculate the matches is the Longest Common Subsequence. In the previous example, LCS is 2. The names of the metrics are Rouge-L scores this time. You can read more about LCS from Wikipedia. Here is a small example:

In this example, the longest common subsequence between AGCAT and GAC is 2.

Is there any library that we can use?

Yes!🎉

Google Search provides rouge metrics as follows:

!pip install rouge-score
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score('I love you',
'I love you so much')

print(scores)
{'rouge1': Score(precision=0.6, recall=1.0, fmeasure=0.7499999999999999), 
'rouge2': Score(precision=0.5, recall=1.0, fmeasure=0.6666666666666666),
'rougeL': Score(precision=0.6, recall=1.0, fmeasure=0.7499999999999999)}

BLEU SCORE

BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. [Wikipedia]

Bleu is used for text translation and it compares to human-generated translations.

Bleu = Average of precision acrossrange of n-gram sizes

We can use HuggingFace’s evaluation library:

!pip install evaluate

import evaluate
bleu = evaluate.load("bleu")

predictions = ["I am happy that I am drinking tea", "foo bar foobar"]
references = [
["I am very happy to say that I am drinking a warm cup of tea"],
["foo bar foobar"]
]

results = bleu.compute(predictions=predictions, references=references)
print(results)
{'bleu': 0.25874458643907805, 
'precisions': [1.0, 0.6666666666666666, 0.42857142857142855, 0.2],
'brevity_penalty': 0.5292133415000504,
'length_ratio': 0.6111111111111112,
'translation_length': 11,
'reference_length': 18}

Let’s read the interpretation of ChatGPT ✨

1. BLEU Score: 0.2587 The overall BLEU score for the translations is approximately 0.2587. As mentioned before, the BLEU score ranges from 0 to 1, and a score closer to 1 indicates a better translation quality. In this case, the score is relatively low, suggesting that the candidate translations are quite different from the reference translations.

2. Precisions:

- 1-gram precision: 1.0

- 2-gram precision: 0.6667

- 3-gram precision: 0.4286

- 4-gram precision: 0.2 These values represent the modified n-gram precision scores for each n-gram order (1 to 4). A higher precision indicates better overlap between the candidate and reference translations for that particular n-gram order. In this case, the 1-gram precision is perfect (1.0), but it decreases for higher n-gram orders.

3. Brevity Penalty: 0.5292 The brevity penalty is a factor that penalizes shorter translations. It is used to address the issue of preferring shorter translations over longer ones, especially when dealing with n-gram precision. A brevity penalty closer to 1 means that the candidate translations and reference translations have similar lengths. In this case, the penalty is 0.5292, indicating that the candidate translations are longer than the reference translations.

4. Length Ratio: 0.6111 The length ratio is the ratio of the total length of the candidate translations to the total length of the reference translations. In this case, the candidate translations are approximately 61.11% the length of the reference translations.

5. Translation Length: 11 The total length of the candidate translations is 11 words.

6. Reference Length: 18 The total length of the reference translations is 18 words.

The brevity penalty can be calculated by taking into account the length of the reference and predicted sentences:

import numpy as np
reference_length = 15
translation_length = 3
brevity_penalty = min(1, np.exp(1 - (reference_length / translation_length)))
# brevity_penalty : 0.018

You can reach the whole code and many other examples from: https://github.com/pelinbalci/LLM_Notebooks/blob/main/LLM_Evaluation.ipynb

Happy learning!😍💕

References

--

--