BLEU : explained
BLEU: Bilingual Evaluation Understudy is a method to evaluate machine generated translations/summaries proposed by Kishore Papineni et al in 2002.
BLEU was proposed as a quick ,inexpensive and language — independent method that correlates highly with human evaluation which was the standard at the time and given that human evaluation takes time, this process was a bottleneck to evaluate ideas on how to build machine translation systems.
The main idea to measure the translation performance as mentioned in the paper is “The closer a machine transaction is to a professional human translation the better it is” and this requires 1. a metric to measure the closeness and 2. a data corpus of containing high quality human reference translations.
In a nutshell BLEU uses the ratio of number of n-grams (n consecutive words) in the machine generated text that are present in the references and the total number of n-grams in the machine generated text.
In this article, I use the task of translation because the authors use it, but the same ideas work for summaries as well.
Main ideas and calculation in BLEU
The baseline BLEU Metric
Here the base metric is calculated based on matching position independent n-grams ) for various values of n , n = 1,2,3,4,…..N. from the machine generated sentence aka a candidate with the reference sentence set (one sentence can have multiple references that are perfectly good translations, depending on the choice of words and the order of the words). The more the matches the better the translation.
Modified n-gram Precision
Precision is computed as the total number of n-grams from a candidate that occur in the references divided by the total number of n-grams in candidate. However, overgeneration of the same n-gram can lead to high precision that are actually bad. To tackle this, the total match count of each candidate n-gram is clipped by the maximum number times the n-gram occurs in any reference. i.e. repeated n-grams’ count is clipped.
Calculation at the level of corpus
To calculate the modified n-gram precision at a corpus level, first the sentence wise n-gram matches are counted and clipped and these counts are summed for all sentences. This number is then divided by the total number of candidate n-grams in the corpus.
Combining Precisions from multiple n-gram sizes
Precision scores calculated at different sizes of n-gram need to be combined to get one score. In BLEU this is done based on the insight that as n increases the scores decrease exponentially. All bi-gram matches have uni-gram matches , all 3-gram matches have bi-gram matches and so on. If the scores are combined linearly the uni-gram score dominates the overall score. In order to account for this, BLEU uses a weighted sum of logarithms of precisions calculated at various values of n.
Common implementation,
in the common implementation of the metric, all the weights are equal and sum to 1 and consequently, the final precision score becomes equal to the geometric mean of the individual n-gram precisions.
Recall the basic properties of logarithms and the definition of geometric mean
Then P can be expressed as the geometric mean of all n-gram precisions.
Brevity Penalty
By clipping the match counts , candidate sentences that are too long are penalized already in BLEU. A multiplicative brevity penalty factor used to penalize short sentences that might otherwise end up with high score. For this, reference length which is defined as the length of the reference with the closest length to candidate is compared with the length of the candidate. At a corpus level, it is calculated as the sum of reference length for all candidate sentence. Brevity penalty factor is 1 when candidate length is greater than reference length and decreases exponentially as the sentence gets shorter.
Final BLEU Score and it logarithm are then defined as
Typically N = 4, 5 is used. But the optimal value may vary from use case to use case.
Some important points to remember when using BLEU
- BLEU score range between 0 and 1 , 1 being the score for a perfect translation. for example : the candidate is exactly equal to a reference.
- BLEU is used mostly as a corpus-based metric. Meaning its used on text with multiple sentences. The scores are not reliable to evaluate individual sentences.
- BLEU does not differentiate between words like ‘the’, ‘on’, ‘is’ and words that are important to the topic of the text. So, missing a word like ‘on’ gets the same penalty as the word denoting topic.
- BLEU does not consider the meaning of the text or the grammar used in the sentence and order of words etc.
- BLEU is impacted by pre-processing steps done such as normalization, tokenization etc. on the candidate and reference sentences.
- BLEU scores tend to increase and become accurate with the number of references available
- Since the metric contains a geometric mean , if any one precision score is zero then the over all scores decreases drastically, some times, this happens when N is too high.
- BLEU does not account for recall , it takes the brevity penalty as a proxy for recall.
- Translations with BLEU scores > 0.3 or 30% are considered / found to be understandable translations
- Translations with BLEU score > 0.5 or 50% are considered/found to be good and fluent translations
- BLEU score is also used in calculating self-BLEU score which is used to assess the diversity in the generated text
For an example of calculation of the BLEU score refer this document from google.