Text Summarization, Part 2 — State Of the Art and Datasets

Published in

Besedo Engineering Blog

9 min readMar 9, 2022

This chapter is the second entry of a blog post series about Automatic Text Summarization. In the first post, we touched on the sub-task of Natural Language Processing by enumerating the different methods to achieve this task (Extractive and Abstractive), and the available metrics to evaluate their performance (ROUGE and BLEU scores).

This section will enumerate the most exciting methods and datasets for automatic text summarization.

This chapter delves even deeper through the domain and gives more technical details about different highly performant Automatic Text Summarization methods.

Some notions in Transformers, especially BERT [8], are required to fully grasp the main idea behind each architecture. Here are some pointers to better understand the key points of those requirements :

The common point between all the models we’re going to explain is that they take tokens (AKA semantic or syntactic units, such as words) as input. Those tokens are then converted to vectors.

Extractive Summarization

Reminder: Automatic Text Summarization via the Extractive method forms a summary by selecting the most pertinent sentences from the text and concatenating them.

BertSumExt

BertSumExt [1] is a neural network based on modifying the BERT Encoder for extractive text summarization.

The modification mainly consists of surrounding each sentence of the text with a [CLS] -classification- token (a special token which represents the entire sentence) and a [SEP] -separator- token (a special token which represents the boundary between two sentences) and assigning different segment embeddings for every pair of sentences. The objective is to get sentence-level contextual representations fed to a classifier for binary classification.

In other words, the model’s input values consist of tokens, with a special token ([CLS]) to represent each sentence. The output of the model is the probability that this “sentence-level” token is part of the extracted summary.

HIBERT

HIBERT (Hierarchical Bidirectional Transformers for Document Summarization) [2] is also a BERT-inspired transformer. It relies on two transformer encoders:

The first one creates contextual representations of a sentence (One document = many sentences),
The second one outputs the probability of that sentence being part of the document summary.

While similar to BertSumExt (see the last section), the main difference is that two neural networks are used. One groups “word-level” tokens into “sentence-level” tokens. The other computes the probability of those “sentence-level” tokens being part of the extracted summary.

MatchSum

According to the paper introducing this method [3], there are two approaches to the Extractive method in literature :

Sentence-level : Score sentences one by one, independently or not,
Summary-level : Consider the semantics of the entire summary.

The summary-level approach presented in the paper, called MatchSum [3], proceeds like this :

Prune unnecessary sentences from the documents to reduce the number of extracted candidates’ summaries, using a sentence scorer (like BertSumExt).
Extract multiple candidate summaries using an extractor based on a Sentence-Level Score and a Summary-Level Score; both are based on the ROUGE score (a metric we introduced in Part 1).

— Pearl-Summary : Low Sentence-Level Score but high Summary-Level Score
— Best Summary : Max Summary-Level Score

Match the document and the candidate summary using a cosine-similarity layer during the inference phase to select the “gold summary.”

This way, the extracted summary is semantically closest to the original text.

Abstractive Summarization

Reminder: Automatic Text Summarization via the Abstractive method consists of forming a summary the same way a human would, by understanding the text and writing a shorter, condensed version of it with minimal data loss.

BertSumAbs

BertSumAbs [1] is a method that originates from the same paper as the one that introduced BertSumExt. It also relies on a modification of BERT to generate summaries according to the abstractive method.

BertSumAbs makes use of the encoder-decoder architecture for generating summaries in an auto-regressive manner (token by token).

The encoder’s purpose is to transform tokens into vectors, which contain relevant syntactic and semantic information.

The decoder’s purpose is to generate tokens from the vectors got from the encoder with respect to the task (summarization, translation, generation).

The encoder-decoder architecture in the context of Automatic Translation

BERTSum is used as the encoder, and the decoder is a 6-layered randomly initialized Transformer. The encoder and decoder use different optimizers during training.

BART

BART (Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension) [4] is a neural network designed for text generation.

The initial parameters of the model are obtained during the self-learning phase, where the model is trained on reconstructing documents to which noise has been introduced.

The noise may take multiple forms, ranging from removing tokens to permuting sentences.

BART is based on a Transformer type architecture, comprising both a bidirectional encoder and an auto-regressive decoder, like BertSumAbs. As a result, BART is often presented as a generalization of BERT and GPT2, whose architectures are respectively based on the encoder and the decoder.

PEGASUS

PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive Summarization) [5] is a specially designed and pre-trained neural network for the automatic text summarization task.

Like BART, PEGASUS is based on the complete architecture of the Transformer, combining both encoder and decoder for text generation. The main difference between the two methods is how self-training is performed.

The initial PEGASUS parameters are obtained by filling gaps in text from which specific sentences and tokens have been removed.

Since this self-learning task is particularly difficult, or even impossible to be performed by humans, the model is forced to learn latent concepts about the syntactic and semantic aspects of the language, which substantially increases its ability to summarize the information present in any piece of text.

GSum

GSum [6] is a framework based on automatic text summarization using guidance signals. These signals can be keywords or phrases entered manually or selected via an algorithm or even summaries obtained via the extractive method.

The reason behind the use of signals is to have a level of control over the as- syntactic and semantic aspects of the content of the generated summary. These signals are fed to the model alongside the document to summarize.

According to the paper [6], “it can be difficult to control the content of summaries; it is hard to pick in advance which aspects of the original content an abstractive system may touch upon.”

The guidance signals are either manually set, automatically predicted from the original text X, or predicted using X and Y (oracle extraction).

Oracle extraction can be done using Extractive methods like MatchSum

Pointer-Generator Networks

This method, preceding the emergence of Transformer type architectures, is based on recurrent neural networks (RNN) and attention weights to generate summaries. [7]

The cost function of the neural network is altered to avoid unnecessary repetitions in summary by penalizing specific tokens according to their occurrence frequency, as well as to deal with rare or Out of Vocabulary tokens by favouring in some instances the copy of a token from the text instead of its generation from the model’s vocabulary.

Datasets

All methods mentioned above apply some form or another of Machine Learning/Deep Learning paradigms, such as RNNs, Transformers, or even both.

Also, the models are trained based on a supervised setting. The model relies on input(text) — output(summary) pairs to predict the probability of the sentence to belong in the summary in Extractive cases and the likelihood of the next word in Abstractive cases.

Many datasets provide those text-summary pairs; here are some examples :

CNN / Dailymail (2015)

Links: Tensorflow / Hugging Face
Licence : Apache-2.0 License (source)

The reference dataset was used to evaluate almost all text summarization models in the literature. It consists of DailyMail and CNN articles and their summaries.

Reddit TIFU (2018)

Links: Tensorflow / Hugging Face
License: MIT license (source)

It consists of generic data written by a wide variety of users on an online discussion platform (Reddit). The Reddit TIFU differs from other datasets used in NLP tasks since it may contain informal language/slang, whereas the vocabulary of other NLP datasets is less casual.

Webis-TLDR-17 / Reddit TL;DR (2017)

Links: Tensorflow / Hugging Face
License: Creative Commons Attribution 4.0 International (source)

The dataset consists of 4 Million content-summary pairs extracted from Reddit from 2006 to 2016. Like the Reddit TIFU dataset, the content is more casual and conversational than news and articles.

Here is the link to the original paper introducing the dataset.

XSum (2017)

Links: HuggingFace / Tensorflow
License: MIT Licence (source)

The dataset consists of one-sentence summaries of newspaper articles.

Selected Methods

This is a comparison table between all the presented methods and datasets. The data comes from the papers which introduce the models :

Comparison Table (Purple: Extractive Methods / Blue: Abstractive Methods)

From the table, we notice that the abstractive methods are generally associated with better performance.

GSum, in particular, provides the best results for CNN / DailyMail and Reddit TIFU Long datasets.
PEGASUS-large provides the best results for the XSum dataset.
BART also seems to provide consistent and satisfactory results on news-based datasets.

We can also notice that the table contains multiple gaps due to insufficient data. The most apparent ones are the extractive methods that ignore the Reddit TIFU dataset, which is vital to evaluate the performance of models on text data containing more text data informal vocabulary.

Moreover, the papers don’t seem to use BLEU as an evaluation metric. ROUGE already by itself is a solid metric that considers both recall (ROUGE-1, ROUGE-2, ROUGE-L) and precision (ROUGE-L).

Therefore, to make a fairer comparison between different methods, we need to benchmark using various techniques and evaluate them against all the datasets mentioned above.

Conclusion

That’s it for this chapter! We hope you enjoyed the different methods which constitute the gold standard for Automatic Text Summarization.

The next chapter will focus on our research work as we train the AI models for the selected methods and obtain/interpret our results.

Stay tuned!

References

[1] Yang Liu et Mirella Lapata. Text Summarization with Pretrained Encoders. 2019. arXiv :1908.08345

[2] Xingxing Zhang, Furu Wei et Ming Zhou. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization. 2019. arXiv :1905.06566

[3] Ming Zhong et al. Extractive Summarization as Text Matching. 2020. arXiv : 2004.08795

[4] Mike Lewis et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. 2019. arXiv : 1910 .13461

[5] Jingqing Zhang et al. PEGASUS: Pre-training with Extracted Gap sentences for Abstractive Summarization. 2020. arXiv : 1912.08777

[6] Zi-Yi Dou et al. GSum: A General Framework for Guided Neural Abstractive Summarization. 2021. arXiv : 2010.08014

[7] Abigail See, Peter J. Liu et Christopher D. Manning. Get To The Point: Summarization with Pointer-Generator Networks. 2017. arXiv : 1704.04368

[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018. arXiv : 1810.04805