Evaluating Language Competence of Llama 2-based models: Belebele Benchmark

Llama 2, Meta AI’s Belebele benchmark for NLU evaluation

10 min readOct 2, 2023

Cute lama taking a multiple choice test (source: Image Creator from Microsoft Bing)

In a previous story, we discussed how to benchmark the language translation abilities of Large Language Models (LLMs) using the BLEU score. In this follow-up tutorial, we’ll explore a new dataset for evaluating language proficiency: Belebele, recently released by Meta AI.

The Belebele dataset comprises 122 languages, 900 questions, and four answer options per question, making it a powerful tool for evaluating LLMs’ language competence. We’ll focus on how to leverage this benchmark for Llama 2-based models using Hugging Face’s Transformers library.

Load model

Before we dive into the benchmarking process, ensure you have the necessary dependencies installed and access to a Llama 2-based model.

Here’s how to get started and load llama2–7B:

import transformers
import torch
from datasets import load_dataset
from tqdm import tqdm

pipeline = transformers.pipeline(
    "text-generation",
    model="models/llama2-7b",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Load dataset

To perform evaluations using the Belebele benchmark, we first need to load the dataset. The Hugging Face Transformers library simplifies this process:

ds = load_dataset(path="facebook/belebele", name="eng_Latn", split="test")

A typical entry in the dataset looks like this:

{
  "link": "https://en.wikibooks.org/wiki/Accordion/Right_hand",
  "question_number": 2,
  "flores_passage": "Make sure your hand is as relaxed as possible while 
    still hitting all the notes correctly - also try not to make much 
    extraneous motion with your fingers. This way, you will tire yourself 
    out as little as possible. Remember there's no need to hit the keys 
    with a lot of force for extra volume like on the piano. On the accordion, 
    to get extra volume, you use the bellows with more pressure or speed.",
  "question": "When playing the accordion, which of 
    the following will help to increase the volume?",
  "mc_answer1": "More speed",
  "mc_answer2": "More force",
  "mc_answer3": "Less pressure",
  "mc_answer4": "Less finger motion",
  "correct_answer_num": "1",
  "dialect": "eng_Latn",
  "ds": "2023-05-03"
}

The Belebele paper was rather short on explaining how exactly they prompted the LLMs:

Examples are sampled from the English training set and prompted to the model (following the template P: <passage> \n Q: <question> \n A: <mc answer 1> \n B: <mc answer 2> \n C: <mc answer 3> \n D: <mc answer 4> \n Answer: <Correct answer letter>)

Above did not work for me, the models generated random choices. The following is the format that eventually made it work:

{passage}
Question: {question}
Answer A: {mc_answer1}
Answer B: {mc_answer2}
Answer C: {mc_answer3}
Answer D: {mc_answer4}
Correct answer:

We will use a few-shot prompting approach. A 5-shot prompt consists of five examples (including the correct answer) inserted before the actual question being asked. To achieve this, we format the first five rows of the dataset as examples:

# Select the first five rows of the dataset for example prompts
ds_examples=ds.select(range(0,5))
ds_prompts=ds.select(range(5,len(ds)))

prompt_template="""{flores_passage}
Question: {question}
Answer A: {mc_answer1}
Answer B: {mc_answer2}
Answer C: {mc_answer3}
Answer D: {mc_answer4}
Correct answer: {correct_answer}"""

# Prepare example prompts for 5-shot prompting
choices=["A","B","C","D"]
prompt_examples = "\n\n".join([ prompt_template.format(**d,correct_answer=choices[int(d["correct_answer_num"])-1]) for d in ds_examples])

prompt_examples now contains the first five rows of the dataset formatted according to our template:

Make sure your hand is as relaxed as possible while still hitting all the notes correctly - also try not to make much extraneous motion with your fingers. This way, you will tire yourself out as little as possible. Remember there's no need to hit the keys with a lot of force for extra volume like on the piano. On the accordion, to get extra volume, you use the bellows with more pressure or speed.
Question: According to the passage, what would not be considered an accurate tip for successfully playing the accordion?
Answer A: For additional volume, increase the force with which you hit the keys
Answer B: Keep unnecessary movement to a minimum in order to preserve your stamina
Answer C: Be mindful of hitting the notes while maintaining a relaxed hand
Answer D: Increase the speed with which you operate the bellows to achieve extra volume
Correct answer: A

Make sure your hand is as relaxed as possible while still hitting all the notes correctly - also try not to make much extraneous motion with your fingers. This way, you will tire yourself out as little as possible. Remember there's no need to hit the keys with a lot of force for extra volume like on the piano. On the accordion, to get extra volume, you use the bellows with more pressure or speed.
Question: When playing the accordion, which of the following will help to increase the volume?
Answer A: More speed
Answer B: More force
Answer C: Less pressure
Answer D: Less finger motion
Correct answer: A

One of the most common problems when trying to convert a movie to DVD format is the overscan. Most televisions are made in a way to please the general public. For that reason, everything you see on the TV had the borders cut, top, bottom and sides. This is made to ensure that the image covers the whole screen. That is called overscan. Unfortunately, when you make a DVD, it's borders will most likely be cut too, and if the video had subtitles too close to the bottom, they won't be fully shown.
Question: Why do the images on television have their borders cut?
Answer A: To allow for subtitles
Answer B: So the image fills the entire screen
Answer C: To allow for simple conversion into other formats
Answer D: To cut subtitles too close to the bottom
Correct answer: B

One of the most common problems when trying to convert a movie to DVD format is the overscan. Most televisions are made in a way to please the general public. For that reason, everything you see on the TV had the borders cut, top, bottom and sides. This is made to ensure that the image covers the whole screen. That is called overscan. Unfortunately, when you make a DVD, it's borders will most likely be cut too, and if the video had subtitles too close to the bottom, they won't be fully shown.
Question: According to the passage, which of the following problems might one encounter when converting a movie to DVD format?
Answer A: An image that doesn’t fill the entire screen
Answer B: Partially cut subtitles
Answer C: An image that fills the entire screen
Answer D: Cut borders
Correct answer: B

The American plan relied on launching coordinated attacks from three different directions. General John Cadwalder would launch a diversionary attack against the British garrison at Bordentown, in order to block off any reinforcements. General James Ewing would take 700 militia across the river at Trenton Ferry, seize the bridge over the Assunpink Creek and prevent any enemy troops from escaping. The main assault force of 2,400 men would cross the river nine miles north of Trenton, and then split into two groups, one under Greene and one under Sullivan, in order to launch a pre-dawn attack.
Question: Where was there a British garrison located?
Answer A: Assunpink Creek
Answer B: Trenton
Answer C: Bordentown
Answer D: Princeton
Correct answer: C

Generate and parse choices

To evaluate the performance of a Llama2-based model, generate and parse choices for each prompt:

# parse model response and extract the model'schoice
def parse_choice(response):
    choices=["A","B","C","D"]
    
    if len(response)==1:
        return choices.index(response[0]) + 1 if response[0] in choices else None
    elif response[0] in choices and not response[1].isalpha():
        return choices.index(response[0]) + 1
    else:
        return None

# sampling parameters: llama-precise
gen_config = {
    "temperature": 0.7,
    "top_p": 0.1,
    "repetition_penalty": 1.18,
    "top_k": 40,
    "do_sample": True,
    "max_new_tokens": 5,
    "pad_token_id": pipeline.tokenizer.eos_token_id,
}

# Loop through prompts and evaluate model responses
q_correct = q_total = 0
for rowNo, row in enumerate(tqdm(ds_prompts)):        
    # Construct the prompt by combining the example prompts and the current row's question
    prompt=(prompt_examples + "\n\n" + prompt_template.format(**row, correct_answer="")).strip()

    # Generate a response from the model
    response=pipeline(prompt, **gen_config)[0]["generated_text"][len(prompt):]
    if "\n" in response:
        response=response.split("\n")[0]

    # Parse the model's choice and compare it to the correct answer
    choice=parse_choice(response.strip())
    if choice==int(row["correct_answer_num"]):
        q_correct+=1 
    q_total+=1

print(f"{q_total} questions, {q_correct} correct ({round(q_correct/q_total*100,1)}%)")

The specific sampling parameters gen_config are from the “llama-precise” preset in Oobabooga’s text-generation-webui and, like all the cool LLM stuff these days, originated somewhere in LocalLLaMa. Most importantly, I found these settings to generate consistent results with little variance.

And that’s it already. Find the complete code on GitHub, including a faster version using batched inference.

Performance of Llama2 base models

To make this a bit more interesting than the three numbers above , let’s look at some of the questions and who could answer and who could not.

Easy questions — correctly answered by all llamas (395 questions)

Army ant colonies march and nest in different phases as well. In the nomadic phase, army ants march at night and stop to camp during the day. The colony begins a nomadic phase when available food has decreased. During this phase, the colony makes temporary nests that are changed everyday. Each of these nomadic rampages or marches lasts for approximately 17 days.
Question: According to the passage, what is true of an army ant colony entering a nomadic phase?
Answer A: They nest during the night
Answer B: They have a low supply of food
Answer C: They make nests that are changed after 17 days
Answer D: They march during the day

The correct answer is marked in bold.

A bit harder — Questions only mastered by 13B and 70B models (249 questions)

Sample 1:

“After its adoption by Congress on July 4, a handwritten draft signed by the President of Congress John Hancock and the Secretary Charles Thomson was then sent a few blocks away to the printing shop of John Dunlap. Through the night between 150 and 200 copies were made, now known as “”Dunlap broadsides””. The first public reading of the document was by John Nixon in the yard of Independence Hall on July 8. One was sent to George Washington on July 6, who had it read to his troops in New York on July 9. A copy reached London on August 10. The 25 Dunlap broadsides still known to exist are the oldest surviving copies of the document. The original handwritten copy has not survived.”
Question: Whose signature appeared on the handwritten draft?
Answer A: John Dunlap
Answer B: George Washington
Answer C: John Nixon
Answer D: Charles Thomson

Sample 2:

The Colonists, seeing this activity, had also called for reinforcements. Troops reinforcing the forward positions included the 1st and 3rd New Hampshire regiments of 200 men, under Colonels John Stark and James Reed (both later became generals). Stark’s men took positions along the fence on the north end of the Colonist’s position. When low tide opened a gap along the Mystic River along the northeast of the peninsula, they quickly extended the fence with a short stone wall to the north ending at the water’s edge on a small beach. Gridley or Stark placed a stake about 100 feet (30 m) in front of the fence and ordered that no one fire until the regulars passed it.
Question: According to the passage, when did Stark’s men extend their fence?
Answer A: While the Colonists called for reinforcements
Answer B: After the regulars passed the stake
Answer C: During low tide
Answer D: While troops assumed forward positions

Hard questions — only 70B model succeeded (134 questions)

Sample 1:

Virtually all computers in use today are based on the manipulation of information which is coded in the form of binary numbers. A binary number can have only one of two values, i.e. 0 or 1, and these numbers are referred to as binary digits — or bits, to use computer jargon.
Question: According to the passage, which of the following is an example of a five bit binary number?
Answer A: 1010
Answer B: 12001
Answer C: 10010
Answer D: 110101

Sample 2:

Asynchronous communication encourages time for reflection and reaction to others. It allows students the ability to work at their own pace and control the pace of instructional information. In addition, there are fewer time restrictions with the possibility of flexible working hours. (Bremer, 1998) The use of the Internet and the World Wide Web allows learners to have access to information at all times. Students can also submit questions to instructors at any time of day and expect reasonably quick responses, rather than waiting until the next face-to-face meeting.
Question: Which of the following is not a benefit of asynchronous communication for students?
Answer A: The use of internet as a resource
Answer B: Face-to-face access to instructors at any time of day
Answer C: Flexible working hours
Answer D: Pace control

Llama-impossible — all models failed (42 questions)

Sample 1:

Unless you are a diplomat, working overseas generally means that you will have to file income tax in the country you are based in. Income tax is structured differently in different countries, and the tax rates and brackets vary widely from one country to another. In some federal countries, such as the United States and Canada, income tax is levied both at the federal level and at the local level, so the rates and brackets can vary from region to region.
Question: What is likely to remain consistent about income tax across various countries?
Answer A: Rates
Answer B: Structure
Answer C: Where you file
Answer D: Brackets

Sample 2:

There are many different film formats that have been used over the years. Standard 35 mm film (36 by 24 mm negative) is much the commonest. It can usually be replenished fairly easily if you run out, and gives resolution roughly comparable to a current DSLR. Some medium-format film cameras use a 6 by 6 cm format, more precisely a 56 by 56 mm negative. This gives resolution almost four times that of a 35 mm negative (3136 mm2 versus 864).
Question: According to the passage, which negative size reflects the film format used most commonly?
Answer A: 6 x 6 cm negative
Answer B: 56 x 56 mm negative
Answer C: 35 mm negative
Answer D: 36 x 24 mm negative

Summary

In this article, we have explored how to use the Belebele benchmark with Hugging Face’s Transformers library. Belebele serves as an invaluable resource for evaluating language models across multilingual and cross-lingual NLU tasks. By following the steps outlined in this guide, you can harness the power of Belebele to assess your language models’ text comprehension capabilities.

If you have any feedback, additional ideas, or questions, feel free to leave a comment here or reach out on Twitter. Happy benchmarking!

Evaluating Language Competence of Llama 2-based models: Belebele Benchmark

Llama 2, Meta AI’s Belebele benchmark for NLU evaluation

Load model

Load dataset

Generate and parse choices

Performance of Llama2 base models

Easy questions — correctly answered by all llamas (395 questions)

A bit harder — Questions only mastered by 13B and 70B models (249 questions)

Hard questions — only 70B model succeeded (134 questions)

Llama-impossible — all models failed (42 questions)

Summary

Written by Geronimo