Understanding BLEU and ROUGE score for NLP evaluation

6 min readApr 16, 2023

As natural language processing (NLP) continues to advance, the need for evaluating NLP models becomes increasingly important.

NLP evaluation metrics allow researchers and practitioners to assess the performance of NLP models objectively and compare them to make informed decisions.

Two commonly used metrics in the field of NLP evaluation are BLEU and ROUGE scores. In this blog post, we will deep dive into these two metrics and understand their significance in evaluating NLP models.

BLEU (Bilingual Evaluation Understudy) Score:

BLEU score is a widely used metric for machine translation tasks, where the goal is to automatically translate text from one language to another. It was proposed as a way to assess the quality of machine-generated translations by comparing them to a set of reference translations provided by human translators.

How does BLEU score work?

BLEU score measures the similarity between the machine-translated text and the reference translations using n-grams, which are contiguous sequences of n words. The most common n-grams used are unigrams (single words), bigrams (two-word sequences), trigrams (three-word sequences), and so on.

BLEU score calculates the precision of n-grams in the machine-generated translation by comparing them to the reference translations. The precision is then modified by a brevity penalty to account for translations that are shorter than the reference translations.

The formula for BLEU score is as follows:

BLEU = BP * exp(∑ pn)

Where:

BP (Brevity Penalty) is a penalty term that adjusts the score for translations that are shorter than the reference translations. It is calculated as min(1, (reference_length / translated_length)), where reference_length is the total number of words in the reference translations, and translated_length is the total number of words in the machine-generated translation.
pn is the precision of n-grams, which is calculated as the number of n-grams that appear in both the machine-generated translation and the reference translations divided by the total number of n-grams in the machine-generated translation.

BLEU score ranges from 0 to 1, with higher values indicating better translation quality. A perfect translation would have a BLEU score of 1, while a completely incorrect translation would have a BLEU score of 0.

Significance of BLEU score:

BLEU score is widely used in machine translation tasks as it provides a simple and effective way to assess the quality of machine-generated translations compared to reference translations. It is easy to calculate and interpret, making it a popular choice for evaluating machine translation models. However, it has some limitations. BLEU score heavily relies on n-grams and may not capture the overall meaning or fluency of the translated text accurately. It may also penalize translations that are longer than the reference translations, which can be unfair in some cases.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Score:

ROUGE score is a set of metrics commonly used for text summarization tasks, where the goal is to automatically generate a concise summary of a longer text. ROUGE was designed to evaluate the quality of machine-generated summaries by comparing them to reference summaries provided by humans.

How does ROUGE score work?

ROUGE score measures the similarity between the machine-generated summary and the reference summaries using overlapping n-grams, word sequences that appear in both the machine-generated summary and the reference summaries. The most common n-grams used are unigrams, bigrams, and trigrams. ROUGE score calculates the recall of n-grams in the machine-generated summary by comparing them to the reference summaries.

The formula for ROUGE score is as follows:

ROUGE = ∑ (Recall of n-grams)

Where:

Recall of n-grams is the number of n-grams that appear in both the machine-generated summary and the reference summaries divided by the total number of n-grams in the reference summaries.

ROUGE score ranges from 0 to 1, with higher values indicating better summary quality. Like BLEU score, a perfect summary would have a ROUGE score of 1, while a completely incorrect summary would have a ROUGE score of 0.

ROUGE scores are branched into ROUGE-N,ROUGE-L, and ROUGE-S.

ROUGE-N: ROUGE-N measures the overlap of n-grams (contiguous sequences of n words) between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the n-gram overlap. For example, ROUGE-1 (unigram) measures the overlap of single words, ROUGE-2 (bigram) measures the overlap of two-word sequences, and so on. ROUGE-N is often used to evaluate the grammatical correctness and fluency of generated text.

ROUGE-L: ROUGE-L measures the longest common subsequence (LCS) between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the length of the LCS. ROUGE-L is often used to evaluate the semantic similarity and content coverage of generated text, as it considers the common subsequence regardless of word order.

ROUGE-S: ROUGE-S measures the skip-bigram (bi-gram with at most one intervening word) overlap between the candidate text and the reference text. It computes the precision, recall, and F1-score based on the skip-bigram overlap. ROUGE-S is often used to evaluate the coherence and local cohesion of generated text, as it captures the semantic similarity between adjacent words.

In summary, ROUGE-N measures the overlap of n-grams, ROUGE-L measures the longest common subsequence, and ROUGE-S measures the skip-bigram overlap between the candidate and reference text.

Significance of ROUGE score:

ROUGE score is widely used in text summarization tasks as it provides a way to objectively assess the quality of machine-generated summaries compared to reference summaries. It takes into account the overlap of n-grams, which helps in capturing the important content of the summary. ROUGE score is also flexible as it allows the use of different n-gram lengths based on the task requirements. However, similar to BLEU score, ROUGE score also has limitations. It may not fully capture the semantic meaning or coherence of the summary, and it relies solely on the n-gram overlap, which may not always be an accurate measure of summary quality.

Let’s explore how we can utilize the Hugging Face evaluate library to evaluate the quality of generated text using popular metrics such as BLEU and ROUGE.

command to install evaluate library:

pip install evaluate

Code for calculating BLEU score:

import evaluate

# Define the candidate predictions and reference sentences
predictions = ["hello there general kenobi", "foo bar foobar"]
references = [["hello there general kenobi", "hello there !"],["foo bar foobar"]]

# Load the BLEU evaluation metric
bleu = evaluate.load("bleu")

# Compute the BLEU score
results = bleu.compute(predictions=predictions, references=references)

# Print the results
print(results)

In the above code,

predictions: This is a list of candidate predictions or generated sentences that you want to evaluate.
references: This is a list of reference sentences, which are the ground truth or gold standard sentences that you want to compare the candidate predictions against.
bleu = evaluate.load("bleu"): This loads the BLEU evaluation metric from the evaluate library. You can specify the desired metric, such as "bleu", "rouge", depending on your evaluation needs.
results = bleu.compute(predictions=predictions, references=references): This computes the BLEU score using the compute function of the loaded BLEU metric. It takes the candidate predictions and reference sentences as inputs and calculates the BLEU score.

Results:

{'bleu': 1.0,
 'precisions': [1.0, 1.0, 1.0, 1.0],
 'brevity_penalty': 1.0,
 'length_ratio': 1.1666666666666667,
 'translation_length': 7,
 'reference_length': 6}

Code for calculating ROUGE score:

import evaluate

# Load the ROUGE evaluation metric
rouge = evaluate.load('rouge')

# Define the candidate predictions and reference sentences
predictions = ["hello there", "general kenobi"]
references = ["hello there", "general kenobi"]

# Compute the ROUGE score
results = rouge.compute(predictions=predictions, references=references)

# Print the results
print(results)

Results:

{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}

The term ‘sum’ in ROUGE-Lsum refers to the fact that this metric is computed over the entire summary, while ROUGE-L is computed as an average over individual sentences.

Conclusion:

In the field of NLP evaluation, BLEU and ROUGE scores are commonly used metrics to assess the quality of machine-generated translations and summaries, respectively. While BLEU score is primarily used for machine translation tasks, ROUGE score is used for text summarization tasks. Both metrics rely on n-gram overlap to measure similarity between the machine-generated output and the reference translations or summaries. They provide a simple and effective way to evaluate NLP models, but they also have limitations in capturing the overall meaning, fluency, and coherence of the output. It is important to consider the specific requirements of the task and the limitations of these metrics while using them for NLP evaluation.

In conclusion, BLEU and ROUGE scores are valuable tools for evaluating the performance of NLP models in machine translation and text summarization tasks, respectively. They provide a quantitative measure of similarity between the machine-generated output and the reference translations or summaries, allowing researchers and practitioners to assess the quality of their models objectively.

Understanding BLEU and ROUGE score for NLP evaluation

BLEU (Bilingual Evaluation Understudy) Score:

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Score:

Conclusion:

Written by Sthanikam Santhosh