Two minutes NLP — Learn the ROUGE metric by examples

ROUGE-N, ROUGE-L, ROUGE-S, pros and cons, and ROUGE vs BLEU

Published in

NLPlanet

5 min readJan 19, 2022

ROUGE (Recall-Oriented Understudy for Gisting Evaluation), is a set of metrics and a software package specifically designed for evaluating automatic summarization, but that can be also used for machine translation. The metrics compare an automatically produced summary or translation against reference (high-quality and human-produced) summaries or translations.

In this article, we cover the main metrics used in the ROUGE package.

ROUGE-N

ROUGE-N measures the number of matching n-grams between the model-generated text and a human-produced reference.

Consider the reference R and the candidate summary C:

R: The cat is on the mat.
C: The cat and the dog.

ROUGE-1

Using R and C, we are going to compute the precision, recall, and F1-score of the matching n-grams. Let’s start computing ROUGE-1 by considering 1-grams only.

ROUGE-1 precision can be computed as the ratio of the number of unigrams in C that appear also in R (that are the words “the”, “cat”, and “the”), over the number of unigrams in C.

ROUGE-1 precision = 3/5 = 0.6

ROUGE-1 recall can be computed as the ratio of the number of unigrams in R that appear also in C (that are the words “the”, “cat”, and “the”), over the number of unigrams in R.

ROUGE-1 recall = 3/6 = 0.5

Then, ROUGE-1 F1-score can be directly obtained from the ROUGE-1 precision and recall using the standard F1-score formula.

ROUGE-1 F1-score = 2 * (precision * recall) / (precision + recall) = 0.54

ROUGE-2

Let’s try computing the ROUGE-2 considering 2-grams.

Remember our reference R and candidate summary C:

R: The cat is on the mat.
C: The cat and the dog.

ROUGE-2 precision is the ratio of the number of 2-grams in C that appear also in R (only the 2-gram “the cat”), over the number of 2-grams in C.

ROUGE-2 precision = 1/4 = 0.25

ROUGE-1 recall is the ratio of the number of 2-grams in R that appear also in C (only the 2-gram “the cat”), over the number of 2-grams in R.

ROUGE-2 recall = 1/5 = 0.20

Therefore, the F1-score is:

ROUGE-2 F1-score = 2 * (precision * recall) / (precision + recall) = 0.22

ROUGE-L

ROUGE-L is based on the longest common subsequence (LCS) between our model output and reference, i.e. the longest sequence of words (not necessarily consecutive, but still in order) that is shared between both. A longer shared sequence should indicate more similarity between the two sequences.

We can compute ROUGE-L recall, precision, and F1-score just like we did with ROUGE-N, but this time we replace each n-gram match with the LCS.

Remember our reference R and candidate summary C:

R: The cat is on the mat.
C: The cat and the dog.

The LCS is the 3-gram “the cat the” (remember that the words are not necessarily consecutive), which appears in both R and C.

ROUGE-L precision is the ratio of the length of the LCS, over the number of unigrams in C.

ROUGE-L precision = 3/5 = 0.6

ROUGE-L precision is the ratio of the length of the LCS, over the number of unigrams in R.

ROUGE-L recall = 3/6 = 0.5

Therefore, the F1-score is:

ROUGE-L F1-score = 2 * (precision * recall) / (precision + recall) = 0.55

ROUGE-S

ROUGE-S allows us to add a degree of leniency to the n-gram matching performed with ROUGE-N and ROUGE-L. ROUGE-S is a skip-gram concurrence metric: this allows to search for consecutive words from the reference text that appear in the model output but are separated by one-or-more other words.

Consider the new reference R and candidate summary C:

R: The cat is on the mat.
C: The gray cat and the dog.

If we consider the 2-gram “the cat”, the ROUGE-2 metric would match it only if it appears in C exactly, but this is not the case since C contains “the gray cat”. However, using ROUGE-S with unigram skipping, “the cat” would match “the gray cat” too.

We can compute ROUGE-S precision, recall, and F1-score in the same way as the other ROUGE metrics.

Pros and Cons of ROUGE

This is the tradeoff to take into account when using ROUGE.

Pros: it correlates positively with human evaluation, it’s inexpensive to compute and language-independent.
Cons: ROUGE does not manage different words that have the same meaning, as it measures syntactical matches rather than semantics.

ROUGE vs BLEU

In case you don’t know the BLEU metric already, I suggest that you read the companion article Learn the BLEU metric by examples to get a grasp on it.

In general:

BLEU focuses on precision: how much the words (and/or n-grams) in the candidate model outputs appear in the human reference.
ROUGE focuses on recall: how much the words (and/or n-grams) in the human references appear in the candidate model outputs.

These results are complementing, as is often the case in the precision-recall tradeoff.

Computing ROUGE with Python

Implementing the ROUGE metrics in Python is easy thanks to the Python rouge library, where you can find ROUGE-1, ROUGE-2, and ROUGE-L. Although present in the rouge paper, ROUGE-S would seem that over time it has been used less and less.

Thank you for reading! If you are interested in learning more about NLP, remember to follow NLPlanet on Medium, LinkedIn, and Twitter!

Two minutes NLP — Learn the ROUGE metric by examples

ROUGE-N, ROUGE-L, ROUGE-S, pros and cons, and ROUGE vs BLEU

ROUGE-N

ROUGE-1

ROUGE-2

ROUGE-L

ROUGE-S

Pros and Cons of ROUGE

ROUGE vs BLEU

Computing ROUGE with Python

Two minutes NLP — Learn the BLEU metric by examples

BLEU, n-grams, geometric mean, and brevity penalty

Awesome NLP — 18 High-Quality Resources for studying NLP

Tutorials, code examples, video courses, course notes, and articles

Two minutes NLP — Gopher Language Model performance in a nutshell

Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG

Written by Fabio Chiusano