BLEU : explained

Kiran Kumar
5 min readApr 24, 2024

--

BLEU: Bilingual Evaluation Understudy is a method to evaluate machine generated translations/summaries proposed by Kishore Papineni et al in 2002.

BLEU was proposed as a quick ,inexpensive and language — independent method that correlates highly with human evaluation which was the standard at the time and given that human evaluation takes time, this process was a bottleneck to evaluate ideas on how to build machine translation systems.

The main idea to measure the translation performance as mentioned in the paper is “The closer a machine transaction is to a professional human translation the better it is” and this requires 1. a metric to measure the closeness and 2. a data corpus of containing high quality human reference translations.

In a nutshell BLEU uses the ratio of number of n-grams (n consecutive words) in the machine generated text that are present in the references and the total number of n-grams in the machine generated text.

In this article, I use the task of translation because the authors use it, but the same ideas work for summaries as well.

Main ideas and calculation in BLEU

The baseline BLEU Metric

Here the base metric is calculated based on matching position independent n-grams ) for various values of n , n = 1,2,3,4,…..N. from the machine generated sentence aka a candidate with the reference sentence set (one sentence can have multiple references that are perfectly good translations, depending on the choice of words and the order of the words). The more the matches the better the translation.

Modified n-gram Precision

Precision is computed as the total number of n-grams from a candidate that occur in the references divided by the total number of n-grams in candidate. However, overgeneration of the same n-gram can lead to high precision that are actually bad. To tackle this, the total match count of each candidate n-gram is clipped by the maximum number times the n-gram occurs in any reference. i.e. repeated n-grams’ count is clipped.

Formula 1 clipped count for an n-gram in candidate sentence
Formula 2 : count of n-grams in candidate
Formula 3 : candidate level modified n-gram precision. count is the number of times n-gram prime is present in candidate

Calculation at the level of corpus

To calculate the modified n-gram precision at a corpus level, first the sentence wise n-gram matches are counted and clipped and these counts are summed for all sentences. This number is then divided by the total number of candidate n-grams in the corpus.

Formula 4 : corpus level n-gram precision. Calculated by summing the numerator and denominator independently across all candidate sentences in the candidate corpus and taking ratio again

Combining Precisions from multiple n-gram sizes

Precision scores calculated at different sizes of n-gram need to be combined to get one score. In BLEU this is done based on the insight that as n increases the scores decrease exponentially. All bi-gram matches have uni-gram matches , all 3-gram matches have bi-gram matches and so on. If the scores are combined linearly the uni-gram score dominates the overall score. In order to account for this, BLEU uses a weighted sum of logarithms of precisions calculated at various values of n.

Formula 5 : Total precision using a weighted sum of logarithms of individual precisions

Common implementation,

in the common implementation of the metric, all the weights are equal and sum to 1 and consequently, the final precision score becomes equal to the geometric mean of the individual n-gram precisions.

Formula 6 : uniform weights, weights are equal and sum to 1

Recall the basic properties of logarithms and the definition of geometric mean

Then P can be expressed as the geometric mean of all n-gram precisions.

Formula 7 : Final Precision as the geometric mean of individual precision when weights are equal and sum to 1

Brevity Penalty

By clipping the match counts , candidate sentences that are too long are penalized already in BLEU. A multiplicative brevity penalty factor used to penalize short sentences that might otherwise end up with high score. For this, reference length which is defined as the length of the reference with the closest length to candidate is compared with the length of the candidate. At a corpus level, it is calculated as the sum of reference length for all candidate sentence. Brevity penalty factor is 1 when candidate length is greater than reference length and decreases exponentially as the sentence gets shorter.

Formula 8: Brevity penalty given reference length and candidate/corpus length c

Final BLEU Score and it logarithm are then defined as

Formula 9: Final BLEU metric , where BP is given by formula 8 and P is given by formula 5/7
Formula 10 : Logarithm of BLEU score

Typically N = 4, 5 is used. But the optimal value may vary from use case to use case.

Some important points to remember when using BLEU

  1. BLEU score range between 0 and 1 , 1 being the score for a perfect translation. for example : the candidate is exactly equal to a reference.
  2. BLEU is used mostly as a corpus-based metric. Meaning its used on text with multiple sentences. The scores are not reliable to evaluate individual sentences.
  3. BLEU does not differentiate between words like ‘the’, ‘on’, ‘is’ and words that are important to the topic of the text. So, missing a word like ‘on’ gets the same penalty as the word denoting topic.
  4. BLEU does not consider the meaning of the text or the grammar used in the sentence and order of words etc.
  5. BLEU is impacted by pre-processing steps done such as normalization, tokenization etc. on the candidate and reference sentences.
  6. BLEU scores tend to increase and become accurate with the number of references available
  7. Since the metric contains a geometric mean , if any one precision score is zero then the over all scores decreases drastically, some times, this happens when N is too high.
  8. BLEU does not account for recall , it takes the brevity penalty as a proxy for recall.
  9. Translations with BLEU scores > 0.3 or 30% are considered / found to be understandable translations
  10. Translations with BLEU score > 0.5 or 50% are considered/found to be good and fluent translations
  11. BLEU score is also used in calculating self-BLEU score which is used to assess the diversity in the generated text

For an example of calculation of the BLEU score refer this document from google.

--

--