Evaluating Language Competence of Llama 2-based models: The BLEU Score

Llama 2, machine translation, sacreBLEU

6 min readSep 26, 2023

source: Image Creator from Microsoft Bing

Introduction

In the era of training and fine-tuning Large Language Models (LLMs) at home, evaluating their language capabilities is crucial. The choice of benchmarking method depends on the specific use case. This article shows how to benchmark the language capabilities of a Llama 2-based LLM using the BLEU score, with a focus on English-to-German translations.

Understanding BLEU

BLEU, which stands for Bilingual Evaluation Understudy, is a metric describing the resemblance between two texts. We employ BLEU to measure the proximity of translations generated by Llama LLMs to a reference sequence, typically a human-generated translation. BLEU evaluates textual similarity based on predefined criteria.

In simpler terms, think of BLEU as a translation quality assessment tool. It quantifies the agreement between the machine-generated translation and the provided reference. The higher the BLEU score, the better the translation, with a perfect score of 1.0 indicating a flawless match.

Here, we use SacreBLEU which “provides hassle-free computation of shareable, comparable, and reproducible BLEU scores” using the HF evaluate library.

Computing BLEU Score for Llama2-based Models

Let’s go through this step by step, we will:

load the dataset and generate prompts for english to german translation (5-shot)
prompt Llama to translate the input
score the output against a reference

To kickstart the process, you must load the Llama2-based model through a Hugging Face pipeline. Here’s how you can do it:

import transformers, evaluate, torch
from datasets import load_dataset
from tqdm import tqdm

pipeline  = transformers.pipeline("text-generation",
    model="models/llama2-7b",
    device_map="auto",
    torch_dtype=torch.bfloat16)

Load dataset

Following model setup, you need to load the dataset for evaluation. In our example, we employ the WMT20 MLQE Task 1 dataset for English to German translation:

ds=load_dataset(path="wmt20_mlqe_task1", name="en-de",split="test")
ds=ds["translation"]

ds_examples=ds[0:5]
ds_predict=ds[5:]

prompt_template="English: {en}\nGerman: {de}"
prompt_examples = "\n\n".join([prompt_template.format(**row) for row in ds_examples])

The “test” split contains 1000k pairs of english/german translations. There’s a lot of other data data in there, for our purpose we only need the translation feature.

The first row in the dataset looks like this:

{
  "de": "Der Sultan ernennt Richter und kann Begnadigungen und Pendelstrafen gew\u00e4hren.",
  "en": "The Sultan appoints judges, and can grant pardons and commute sentences."
}

In the code above, we will set the first 5 entries aside to generate the 5-shot prompt (ds_examples) which will be used to prompt Llama for the translation of the remaining 995 english sentences (ds_predict).

The five examples prefixed to each prompt will therefore be:

English: The Sultan appoints judges, and can grant pardons and commute sentences.
German: Der Sultan ernennt Richter und kann Begnadigungen und Pendelstrafen gewähren.

English: Antisemitism in modern Ukraine Antisemitism and Special Relativity
German: Antisemitismus in der modernen Ukraine Antisemitismus und besondere Relativität

English: Morales continued his feud with Buddy Rose, defeating him by disqualification.
German: Morales setzte seine Fehde mit Buddy Rose fort und besiegte ihn durch Disqualifikation.

English: American Maury Tripp attended the Jamboree from Saratoga, California.
German: Der Amerikaner Maury Tripp besuchte das Jamboree aus Saratoga, Kalifornien.

English: He bowled a series of bouncers at Viv Richards at Brisbane and claimed 3/77 and 5/92 in the Third Test at Melbourne.
German: Er boomte eine Reihe von Bouncern bei Viv Richards in Brisbane und behauptete 3 / 77 und 5 / 92 im dritten Test in Melbourne.

In case you wonder “why five?”, here’s a small experiment: seems like Llama needs a few examples and, with this translation task and the 7b model, performance saturates at five provided examples.

Sampling parameters

Now, we are all set to prompt Llama2 to generate English to German translations using the following sampling parameters:

gen_config = {
    "temperature": 0.7,
    "top_p": 0.1,
    "repetition_penalty": 1.18,
    "top_k": 40,
    "do_sample": True,
    "max_new_tokens": 100,  
    "pad_token_id": pipeline.tokenizer.eos_token_id,
}

gen_config specifies the sampling strategy for text generation using a language model. Here’s a breakdown of the key parameters:

Temperature controls the randomness of the generated text. A higher temperature (e.g., 0.7) increases randomness, leading to more diverse but potentially less coherent text. Lower values (e.g., 0.2) make the generation more deterministic and focused.
Top-p Sampling (Nucleus Sampling): top_p (or nucleus) sampling is a probabilistic strategy that selects the most likely tokens while avoiding overly repetitive or predictable text. A value of 0.1 means that only the top 10% most likely tokens are considered at each step.
Repetition Penalty: repetition_penalty discourages the model from repeating the same token within a short span of text. A value of 1.18 increases the penalty for repetition, making the model less likely to produce repetitive sequences.
Top-k Sampling: top_k sampling selects the top k most likely tokens at each step. Setting it to 40 means the model will consider the top 40 tokens based on their probabilities.
Do Sample: do_sample set to True indicates that the model should use sampling rather than greedy decoding. Sampling involves selecting tokens based on their probabilities, as opposed to always selecting the most likely token.
max_new_tokens sets a limit on the number of tokens that can be generated in a single sequence. In this case, it’s set to 100, which means that the generated text won’t exceed 100 tokens, which is enough for the short samples in this dataset.

The specific sampling parameters above are from the “llama-precise” preset in Oobabooga’s text-generation-webui and, like all the cool LLM stuff these days, originated somewhere in LocalLLaMa. Most importantly, I found these settings to generate consistent translations with little variance.

Predict ..

Proceed to iterate through the dataset to generate predictions and gather references:

predictions=[]
for row in tqdm(ds_predict):
    prompt=prompt_examples + "\n\n" + prompt_template.format(en=row["en"], de="")[:-1]
    prediction=pipeline(prompt, **gen_config)[0]["generated_text"][len(prompt)+1:]

    if "\n" in prediction:
        prediction=prediction.split("\n")[0]
    predictions.append(prediction)

.. and score

Finally, evaluate the generated translations utilizing the sacreBLEU metric from the HF evaluate library:

sacrebleu = evaluate.load("sacrebleu")
sacrebleu_results=sacrebleu.compute(predictions=predictions, references=references)

print(sacrebleu_results["score"])

This code is very simple for illustration purposes but also inefficient, find a version using batched inference here (~10x faster).

Performance of the Llama2 base models

995 english to german translations, 5-shot prompts. 70B model was loaded in 4bit (VRAM poor). code

First, please note that sacreBLEU reports scores between 0 and 100, 0 being zero resemblence, 100=identical sentences.

What do these numbers mean? Is 34.2 good, is 27.2 bad? BLEU scores are highly dependent on the text being evaluated, the absolute numbers should not be interpreted independent of context. There is however a “rough guideline” to BLEU score interpretation:

< 10: Almost useless
10–19: Hard to get the gist
20–29 :The gist is clear, but has significant grammatical errors
30–40: Understandable to good translations
40–50: High quality translations
50–60: Very high quality, adequate, and fluent translations
> 60: Quality often better than human

Conclusion

By following these steps, you can effectively benchmark the language competence of Llama2-based models across diverse fine-tuning scenarios. The BLEU score offers a robust methodology for evaluating translation quality and appraising LLMs’ linguistic capabilities.

For access to the code presented here, along with a version that incorporates batched inference, please refer to my GitHub repository.

If you have any feedback, additional ideas, or questions, feel free to leave a comment here or reach out on Twitter.