A benchmark comparison of extractive summarisation systems

16 min readMay 23, 2017

The Information Age we are living in, fuelled by the advent of the World Wide Web and the concomitant rise in computational capabilities, fundamentally changed the way humans access and consume information. Having a plethora of information widely available at the distance of a click offers boundless opportunities for new businesses to emerge and transforms the way knowledge-based work is conducted. However, despite its undeniable advantages, the quantity of information available for consumption can easily become overwhelming, hindering our ability to keep up, make sense of it and, ultimately, convert that information into useful knowledge. It’s a real challenge to keep track of what’s important when you’re overloaded with information from multiple sources and your time is a limited resource.

Being aware of these challenges, our goal at Skim Technologies is to continuously develop smart technologies to help businesses and knowledge workers thrive in the Information Age. How? By leveraging Machine Learning (ML) and Natural Language Processing (NLP) to identify, extract, summarise, and timely deliver relevant content from news, blog posts, articles, and web pages in general.

In this blog post, we will focus on our text summarisation technology by briefly explaining how it works and how it compares to other automatic summarisation systems. If you’re not already familiar with automatic summarisation, I encourage you to read my previous post, where I provide a brief overview of the topic.

Summarisation meets Machine Learning

At Skim Technologies, we learned that the majority of web pages our users are interested in is comprised of text. Given this, we set out to devise an automatic text summarisation system able to read through a piece of text, pinpoint the key takeaways, and wrap it up in a digestible, easy-to-read summary (which we call a skim). The idea behind our summarisation is not to replace the original document per se, but to help you quickly identify the articles that are important and interesting to you, so you can better manage your time consuming information.

Type of summaries:
Since there are different types of text summaries, at the beginning of our project we had to determine which kind of text summary would best fit our users’ needs. The answer was generic single-document summaries, which, as the terms imply, are summaries generated based on the content of a single document (as opposed to multiple documents) and which are able to concisely capture all the topics addressed in that document (as opposed to a specific topic), making them more general. Regarding the way the summary is presented, we chose an extractive approach (as opposed to an abstractive approach), which can be intuitively understood as a sequence of “copy-and-paste” operations, since it relies on selecting relevant passages, or sentences, in the source text and concatenate them to form a summary. We opted for this strategy for three main reasons: (i) extractive summarisation is well-established in the academic community, (ii) it is generally able to produce grammatically and semantically correct summaries, since it uses natural language phrases taken directly from the source text, and (iii) most of the time it works well in practice for article-type web pages, producing coherent, fairly readable, and meaningful summaries.

Gold Standard Corpus:
After clearly defining the type of summaries we would like our system to generate, we drew on ML & NLP advances to build our solution. As you may well know, one of the main ingredients of ML is data, so the next natural step was to collect data to feed a ML algorithm in order to create an extractive text summarisation model. Since we framed the summarisation problem as a supervised learning problem, we had to collect examples of good summaries to help the ML algorithm learn how to summarise. To do so, we set up annotation exercises with specialised human annotators to obtain an annotated corpus. Each annotator was asked to read a set of web articles and, for each article, select the best summary sentences so that the final summary had approximately 100 words. The web articles used in the annotation exercises were selected via stratified random sampling from a collection of web pages comprised of text articles and blog posts. We decided upon for a stratified random sampling strategy over a simple random sampling with the goal of ensuring diversity of websites and avoiding source bias. After running the annotation exercises, the average Inter-Annotator Agreement (IAA) for all annotators was 0.43, as measured by the Krippendorff’s Alpha coefficient. This value is below the minimum acceptable level of agreement required to obtain reliable conclusions from the annotation data, which, according to Krippendorff, should be 0.667. To mitigate this problem, we computed the IAA for all possible combinations of annotators and selected the combination of annotators yielding the highest IAA. The maximum obtained IAA was 0.587, a value still below 0.667. However, as also pointed out by Krippendorff, the acceptable values vary depending on the type of application, because achieving the recommended values may be very challenging and difficult in practice, especially for a complex and subjective task as summarisation.

Having collected the annotated data, the next step was to generate average reference summaries using the input from the annotators in order to obtain a gold standard corpus (i.e., a collection of text summaries that represent the majority opinion of annotators on what are the most representative sentences of an article). The average reference summary of a given article is comprised of a set of summary sentences selected by at least two annotators. In an attempt to capture the implicit importance of each sentence for the summary, we ranked these sentences by the number of annotators who selected them.

Machine Learning:
We tackled the problem of extractive text summarisation as a supervised regression problem rather than as typical classification problem (i.e., class 1 if the sentence should enter the summary, and class 0 otherwise) so as to harness the rich information we had available regarding sentence importance. In order to build our summarisation model, we started by randomly splitting the collected gold standard documents into training, validation, and test sets. The held-out documents belonging to the test set were only used for assessing the generalisation performance of our extractive summarisation model, as presented in the next section. Since our input data is unstructured text, we preprocessed the corpus using NLP techniques and derived some structure from it by defining and computing several domain-driven features. The preprocessed dataset was then fed into a wide range of ML algorithms with different hyperparameter settings, and the best cross-validated model was selected. The result is an automatic extractive summarisation model, able to deliver coherent and meaningful summaries in most cases, and very efficient to run.

Benchmark Evaluation

Evaluation is an essential step when developing new systems since it enables monitoring progress, identifying weaknesses, measuring success, and comparing their performance with different systems that offer a similar type of service. At Skim Technologies we strive to continuously assess our systems to make sure we’re on the right track.

The aim of this section is to report our latest benchmark evaluation study, where we compare our summarisation model with similar extractive summarisation systems, using the gold standard test corpus we collected.

Evaluation setup:
Four benchmarks were evaluated on our test corpus comprised of 80 web pages, which were harvested from a diversity of news sources (e.g., BBC, Wired, The Huffington Post, Crowdfund Insider, CIO, How Design, Jurist) and blogs (e.g., Conversable Economist, Mad Data Scientist, The Marketing Student). All web pages are written in English, which means we only test the ability of these systems to perform monolingual summarisation.

The benchmarks included on our evaluation are the following:

Smmry: summarisation API
Indico: summarisation API
Aylien: text analysis API
Gensim: summarisation based on the TextRank algorithm

The first three benchmarks are commercial APIs, while the latter is an open-source Python library. Notice that we don’t cover all the summarisation systems out there, and this is mainly due to paid access or lack of descriptive documentation.

With the exception of Indico API, which automatically determines the best summary length for a given article, we had to set the desired length of the summary for each system. For both Smmry and Aylien, we set the number of sentences to 4. In Gensim, the parameter indicating the desired summary length is defined in terms of word counts, which we set to 100 words, so as to ensure consistency with our summarisation model. Before proceeding with the evaluation, we confirmed that all systems produced, on average, summaries with approximately the same number of words.

To assess how “good” the summaries generated by each system are, we chose ROUGE-N (Recall-Oriented Understudy for Gisting Evaluation), which is a N-gram recall based statistic that has been shown to have good correlation with human judgments in summarisation evaluation tasks. In simple terms, ROUGE-N measures the overlap between sub-phrases in the system-generated summaries with sub-phrases in the gold standard reference summaries (i.e., human-generated summaries). The length of these sub-phrases or sub-sequences is dictated by N. The higher the overlap between both kinds of summaries, the better, with 1 being the maximum (and best) possible ROUGE value. The main reason for choosing an objective and easy to compute statistic, such as ROUGE-N, as our evaluation measure is because asking humans to manually assess dozens of summaries is time-consuming, expensive, and impractical at scale. Besides, ROUGE-N is the de facto measure for the evaluation of summarisation models, being commonly reported in scientific articles on summarisation.

Results:
The results of our benchmark evaluation, in terms of ROUGE-2 (i.e., bigram overlap), are depicted in the bar plot of Figure 1. ROUGE is computed for each test document, and here we report the aggregated ROUGE score for each summarisation system. We have selected the median instead of the arithmetic mean as our aggregation measure because the distribution of the ROUGE scores for each system was slightly skewed. In such cases, the median is usually preferred to the mean as a central tendency measure, due to its robustness to outliers.

Figure 1: Median ROUGE-2 scores, with the corresponding error bars, for the Skim API, Indico, Aylien, Gensim (TextRank) and Smmry summarisation systems. These results were obtained on the Skim Technologies test corpus, comprised of 80 documents.

For completeness, in Figure 2 we also present the median ROUGE scores for different sub-phrases lengths: unigrams (ROUGE-1), bigrams (ROUGE-2), trigrams (ROUGE-3), and fourgrams (ROUGE-4), for each summarisation system under analysis. Table 1 provides the detailed median ROUGE-N scores.

Table 1: Median scores for different variants of ROUGE (ROUGE-1, ROUGE-2, ROUGE-3, and ROUGE-4) and for different summarisation systems.

Figure 2: Median scores for different variants of ROUGE (ROUGE-1, ROUGE-2, ROUGE-3, and ROUGE-4) and for different summarisation systems. These results were obtained on the Skim Technologies test corpus, comprised of 80 documents.

As can be ascertained from both figures, the Skim Technologies summariser consistently outperforms all other summarisation systems under evaluation in terms of ROUGE-1, ROUGE-2, ROUGE-3, and ROUGE-4. The second best performing system is the one provided by Indico, which achieves a median ROUGE-2 of 0.57. Aylien summarisation API ranks third in our evaluation, followed by the open-source Gensim summariser which is based on the popular TextRank algorithm. Smmry achieves a performance similar to that of Gensim, obtaining slightly lower values of ROUGE-2, ROUGE-3, and ROUGE-4.

Sample Outputs:
Although ROUGE-N is a straightforward and convenient way of assessing the performance of different summarisation systems, it is unable to measure other important evaluation dimensions, such as the readability of a system-generated summary. For this reason, here we present the text summaries generated by our summarisation system and the benchmarks, for an out-of-sample web article (not included in the gold standard corpus). The choice of this example was subjective and it’s unwise to extrapolate conclusions from single examples. But we don’t have space to present lots of them, and we think this example illustrates some typical traits we’ve noticed.

Original BBC Article

Airborne dust is normally seen as an environmental problem, but the lack of it is making air pollution over China considerably worse. A new study suggests less dust means more solar radiation hits the land surface, which reduces wind speed. That lack of wind in turn leads to an accumulation of air pollution over heavily populated parts of China. The researchers found that reduced dust levels cause a 13% increase in human-made pollution in the region. Hundreds of millions of people across China continue to be impacted by air pollution from factories and coal-fired power plants. Studies suggest that the dirty air contributes to 1.6 million deaths a year, about 17% of all mortalities. But this new research says that the human-induced pollution is being made worse or better by naturally occurring dust that blows in from the Gobi desert. Using models to simulate 150 years of wind and dust patterns in the region, the researchers found that the dust deflects significant amounts of sunlight. Without it, more heat from the Sun hits the land. Differences in the temperatures between land and sea cause the winds to blow. Without the dust, the land warms up more and that changes the temperature differential with the sea leading to weaker breezes — and more air pollution. Researchers say there’s a link between the amount of dust in the air and the levels of air pollution “There are two dust sources. One is the Gobi and the other is the highlands of north-west China, but we found the Gobi had much more influence,” said lead author Yang Yang, from the Pacific Northwest National Laboratory in Washington State, US. “Less dust in the atmosphere causes more solar radiation to reach the surface. It weakens the temperature difference between the land and the sea and impacts the circulation of the winds and causes a stagnation over eastern China and that causes an accumulation of air pollution.” The decreases in dust emissions are considerable, varying by almost a third. The impact on winds speeds are quite small by comparison, a reduction of barely more than one-tenth of one mile per hour. However, when this takes place on a large scale over a wide region, the small change in speed means a 13% increase in the amount of air pollution over eastern China during the winter. Another study has recently shown a link between declining Arctic sea ice and a major air pollution event in China in 2013. The authors of the new study believe that both theories could be true. “Our study has the same mechanism: the weakening of winds causes more pollution, and what is behind this needs to be studied,” said Yang Yang. “We have two views on this kind of weakening of wind. They found the sea ice, we found the dust-wind interaction can also lead to weakening of the wind. I think both of them are important.” The researchers believe that the study may inform broader questions about how natural and human-created aerosols interact. Many parts of the world, in addition to China, are now suffering from increased levels of air pollution and understanding how dust, winds and emissions work together may help limit some of the worst impacts of dirty air. One of the key lessons from this study is that the absence of dusty conditions could mean the air you are breathing is worse for you, not better. “You’re damned if you do, damned if you don’t,” said Prof Lynn Russell from the Scripps Institution of Oceanography in California. “Dust emissions can impair visibility, but they are not so harmful in terms of air quality,” she told BBC News. “If it’s not a dusty year, you may be happy and spending more time outdoors because you don’t have this dust in the way, but you are actually going out to spend more time in more toxic air.”
The study has been published in the journal Nature Communications.

Skim API

Airborne dust is normally seen as an environmental problem, but the lack of it is making air pollution over China considerably worse. A new study suggests less dust means more solar radiation hits the land surface, which reduces wind speed. That lack of wind in turn leads to an accumulation of air pollution over heavily populated parts of China. The researchers found that reduced dust levels cause a 13% increase in human-made pollution in the region. Hundreds of millions of people across China continue to be impacted by air pollution from factories and coal-fired power plants.
Studies suggest that the dirty air contributes to 1.6 million deaths a year, about 17% of all mortalities.

Indico

Airborne dust is normally seen as an environmental problem, but the lack of it is making air pollution over China considerably worse. A new study suggests less dust means more solar radiation hits the land surface, which reduces wind speed. The researchers found that reduced dust levels cause a 13% increase in human-made pollution in the region. Hundreds of millions of people across China continue to be impacted by air pollution from factories and coal-fired power plants.
“There are two dust sources.

Aylien

Without the dust, the land warms up more and that changes the temperature differential with the sea leading to weaker breezes — and more air pollution. It weakens the temperature difference between the land and the sea and impacts the circulation of the winds and causes a stagnation over eastern China and that causes an accumulation of air pollution.” Another study has recently shown a link between declining Arctic sea ice and a major air pollution event in China in 2013.
Many parts of the world, in addition to China, are now suffering from increased levels of air pollution and understanding how dust, winds and emissions work together may help limit some of the worst impacts of dirty air.

Gensim

A new study suggests less dust means more solar radiation hits the land surface, which reduces wind speed. It weakens the temperature difference between the land and the sea and impacts the circulation of the winds and causes a stagnation over eastern China and that causes an accumulation of air pollution.”
Many parts of the world, in addition to China, are now suffering from increased levels of air pollution and understanding how dust, winds and emissions work together may help limit some of the worst impacts of dirty air.

Smmry

Airborne dust is normally seen as an environmental problem, but the lack of it is making air pollution over China considerably worse. Without the dust, the land warms up more and that changes the temperature differential with the sea leading to weaker breezes — and more air pollution. “There are two dust sources. One is the Gobi and the other is the highlands of north-west China, but we found the Gobi had much more influence,” said lead author Yang Yang, from the Pacific Northwest National Laboratory in Washington State, US. “Less dust in the atmosphere causes more solar radiation to reach the surface. It weakens the temperature difference between the land and the sea and impacts the circulation of the winds and causes a stagnation over eastern China and that causes an accumulation of air pollution.” The decreases in dust emissions are considerable, varying by almost a third.
Many parts of the world, in addition to China, are now suffering from increased levels of air pollution and understanding how dust, winds and emissions work together may help limit some of the worst impacts of dirty air.

In this example, we observe that the summary produced by our summarisation system is readable, coherent, and able to convey the gist of the original article. The Indico summary is very similar to ours, except at the end, where it picks an incomplete sentence that breaks the flow of the text, becoming less readable. The remaining three benchmarks (Aylien, Gensim, Smmry) seem to have a distinct concept of what is a “relevant” sentence, when compared to the Skim API and Indico summarisers, and also seem to generate more diverse summaries among them (apart from the last sentence, which is exactly the same for all of them). However, my subjective impression is that these latter summarisation systems tend to generate less clear, less fluent and less articulate summaries. An exception is the summary generated by Smmry, which is readable and coherent, showing that lower ROUGE scores are not necessarily correlated with poor readability.

Limitations:
We acknowledge that the presented evaluation still has room for further improvements, namely, in terms of the data used. Here we rely on our own collected test corpus and assess each system with a small number of documents (i.e., 80 web pages). Although we did our best to collect an unbiased corpus, the data may still carry inherent biases. With the aim of controlling these biases, we plan to extend this evaluation by running experiments on publicly available corpora, such as the CNN/Daily Mail dataset, in order to ensure reproducibility of results.

Final Remarks

In this blog post, we offered you a glimpse of our journey into automatic text summarisation, by explaining how our current summarisation technology works and how it compares with alternative summarisation systems. Based on the evaluation reported here, which relied on article-style web pages, our extractive summarisation model outperforms well-known extractive summarisation systems, such as the ones provided by Indico, Aylien, Smmry, and Gensim, in terms of ROUGE-N (for N in {1, 2, 3, 4}).

Nevertheless, extractive summarisation has still a long way to go before being able to match a human’s ability to consistently generate high-quality and coherent summaries for any type of document (for instance, literature books) or web page (e.g., product reviews, tutorials). This is intrinsically related to the difficulty of the summarisation task itself and the current inability of computers to master natural language as humans do. However, recent developments in automatic text summarisation, fostered by the emergence of deep learning, look encouraging and very promising. See, for instance, the work of Narayan et al and Nallapati et al on extractive summarisation, and the work of See et al, Zhou et al, and Paulus et al on abstractive summarisation. Given the successful advances in automatic summarisation using Recurrent Neural Networks (RNN’s), we want to further investigate the commercial applications of using RNN’s in summarisation against our current standard, that already serves our users with coherent and readable summaries.

Originally published at blog.skimtechnologies.com on May 23, 2017.

A benchmark comparison of extractive summarisation systems

Summarisation meets Machine Learning

Benchmark Evaluation

Final Remarks

Written by Marcia Oliveira