Installation & Working process of ROUGE-1.5.5

4 min readJul 3, 2019

Evaluation metric for Summarization task.

In this post, you will get to know how to install and work with an evaluation metric for summarization called ROUGE. There are many versions of the ROUGE metric. So how to get it to work with your system using Python? I will give you some idea based on my experience. My method might not be good but it worked for me. I’ll share a small introduction to ROUGE and focus more on the installation and working process of it.

What is ROUGE and why?

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. Generally, it is a set of metrics for evaluating the summarization of texts as well as Machine Translation. It works by comparing a produced summary(System-generated summary) against a reference summary (human-produced).
Let us say.

System-generated summary: The programmer got an error in the code
Human-generated summary: The programmer got an error in the code

If we consider the individual words, the number of overlapping words between both summaries is 6. However, this is not a good way of evaluation and it does not work as a metric. We get a quantitative value by computing the precision and recall using overlap in the context of ROUGE.

Recall

It is simply how much the reference summary the system summary captures.
(number_of_overlapping_words) / (total_words_in_reference_summary)
In above example, recall : 6/6 = 1.0

A machine generated summary (system summary) can be extremely long, capturing all words in the reference summary. But, much of the words in the system summary may be useless, making the summary unnecessarily verbose. This is where precision comes into play.

Precision

How much of the system summary was in fact relevant or needed?
precision measured as :
(number_of_overlapping_words) / (total_words_in_system_summary)
For the above example : precision: 6/8 = 0.75
In this way, we calculate precision, recall and F-Score for Bigrams, trigrams, and N-grams.( https://www.aclweb.org/anthology/W04-1013 for more info.)

Installation of ROUGE

Among the versions of ROUGE metrics, ROUGE-1.5.5 worked for me.
Process of installing the ROUGE 1.5.5
→ Download the ROUGE-1.5.5 directory from https://github.com/andersjo/pyrouge/tree/master/tools/ROUGE-1.5.5
→ Follow these commands

git clone https://github.com/bheinzerling/pyrouge
cd pyrouge
python setup.py install
pyrouge_set_rouge_path /absolute/path/to/ROUGE-1.5.5/directory
python -m pyrouge.test

After doing this, Everything goes Okay and I get this successful message of testing:

Ran 11 tests in 6.322s
OK

If you get this message without any error, installation is completed.

Working with ROUGE-1.5.5

Rouge-1.5.5 is a python module that will automatically evaluate the ROUGE score. Remember to set your ROUGE path (absolute path to the ROUGE-1.5.5 directory, which contains the Perl script), and run a test.

Run with Python code

In my case, I have my system outputs organised as follows:

I have my reference folder for original summarizations. Each txt file contains a single line for each article, and it is clear that for the file name, there is an ID. Same format for the decoded folder, where I keep the system outputs. In my case, they are summarizations created by the machine, and the txt file names are also attached with an ID, paired with their true results in the reference folder.
ID is important when using pyrouge to get ROUGE scores. I changed the name format to match my case, based on the codes from the official document:

from pyrouge import Rouge155
r = Rouge155()
# set directories
r.system_dir = ‘decoded/’
r.model_dir = ‘reference/’
 
# define the patterns
r.system_filename_pattern = ‘(\d+)_decoded.txt’
r.model_filename_pattern = ‘#ID#_reference.txt’
 
# Use default parameters to run the evaluation
output = r.convert_and_evaluate()
print(output)
output_dict = r.output_to_dict(output)

You can see many log info come out:

2017–12–18 11:21:36,865 [MainThread ] [INFO ] Writing summaries.
2017–12–18 11:21:36,868 [MainThread ] [INFO ] Processing summaries. Saving
…

Then after some processing (transform the txt files into other format files), you can see the default parameters and finally a table as a results.

Generally we report ROUGE-2 Average_F and ROUGE-L Average_F scores.

Problem we may encounter : illegal division by zero

I encountered by this error:

Now starting ROUGE eval…
Illegal division by zero at /home/lily/zl379/RELEASE-1.5.5/ROUGE-1.5.5.pl line 2455.
subprocess.CalledProcessError: Command ‘[‘/home/lily/zl379/RELEASE-1.5.5/ROUGE-1.5.5.pl’, ‘-e’, ‘/home/lily/zl379/RELEASE-1.5.5/data’, ‘-c’, ‘95’, ‘-2’, ‘-1’, ‘-U’, ‘-r’, ‘1000’, ‘-n’, ‘4’, ‘-w’, ‘1.2’, ‘-a’, ‘-m’, ‘/tmp/tmpuu0bqmes/rouge_conf.xml’]’ returned non-zero exit status 255

So, I checked line 2455 in ROUGE-1.5.5.pl:

$$score=wlcsWeightInverse($$hit/$$base,$weightFactor);

The error said “illegal division by zero”, which means here the $$base could be zero. The problem is that the txt files might be empty ones. Or, if the txt files contain something like <br> or other strange characters. By filtering them, I got the problem solved easily. To check the strange characters print the text in the terminal(Console) instead of checking in .txt file.

NOTE :
I worked with English and Telugu datasets. Mostly it works well for both. If you are dealing with Telugu(or some other language) data and still getting the above error then convert the data to WX format before evaluation starts. For more information about how to convert data into WX format refer to https://github.com/ltrc/indic-wx-converter.

Thanks for reading!