Text Summarization, Part 3 — Data Pipeline and Results

Published in

Besedo Engineering Blog

7 min readMar 15, 2022

This chapter is the third entry of a blog post series about Automatic Text Summarization.

The first chapter introduced the reader to this sub-task of Natural Language Processing by enumerating the different methods to achieve this task (Abstractive, Extractive) and the available metrics (ROUGE, BLEU) to evaluate their performance.

The second chapter gave more details about the technical aspect of multiple state-of-the-art models, in addition to enumerating some datasets used to estimate those models’ parameters.

In this chapter, we’re going to present the data pipeline used to train our summarization models, as well as showcase the results of the training procedure on our test sets.

Note: This article is the direct continuation of chapter 2. Please be advised you should read that beforehand.

Selected Models

From the benchmark results we got from chapter 2, we have selected the methods which rely on the Transformer architecture due to their black-box nature, which allows us to build a typical, scalable and modular data pipeline around them without drastically changing our code architecture when using different models.

We also selected those methods due to their relative simplicity compared to more complex frameworks, like GSum or MatchSum, which rely on multiple models and algorithms to reach their full potential.

Extractive Summarization: BertSumExt (Using BERT as the encoder, limited to 512 tokens per text), BertSumExt (Using the Longformer architecture as the encoder to handle large texts that exceed 512 tokens)

Abstractive Summarization: BART and PEGASUS

Selected Datasets

The datasets used to finetune/train and evaluate our models are :

CNN/DailyMail
XSum
Reddit TIFU
Reddit TL;DR

More info about those datasets (links, descriptions, licenses) in the second chapter.

Tools used

Python is the language used to build this project from beginning to end.

To code our neural network models and the necessary functions to train them, Pytorch was used.

The HuggingFace library was used to load the appropriate datasets, tokenizers, and models with initialized weights from their self-learning on large corpora of text, to get solid token embeddings and focus more on fine-tuning our models to the summarization task.

BART : facebook/bart-large

PEGASUS : google/pegasus-large

BERT (for BertSumExt) : bert-base-uncased

Longformer (for BerSumExt) : allenai/longformer-base-4096

For the orchestration of the data pipeline and version controlling of large files (models and datasets especially), DVC was used.

To keep track of the training evaluation metrics (loss, validation loss, ROUGE…), Tensorboard was used.

Data Pipeline steps

1 — Extract

The data is loaded from the hard drive or downloaded via the Internet.

2 — Preprocess

The data is preprocessed (selection of relevant characteristics, text cleaning, filtering…) then is divided into training, validation and test sets.

While this step is straightforward for datasets used to train Abstractive summarization models, additional steps are required for preprocessing datasets used to train Extractive Summarization models.

First, we segment the text into multiple sentences using Stanford CoreNLP. Each sentence will then be classified as being part of the extracted summary.

A common feature among all the mentioned datasets in chapter 2 is that the provided summary is written by a human. Extractive summarization being a classification problem, we need to get binary labels for each segmented sentence for each text in the dataset.

Indeed, Abstractive summarization datasets only need pairs of raw text (document, summary) to be exploitable by virtually any abstractive model based on an encoder-decoder architecture.

Extractive datasets need labels for each sentence (0 — the sentence is not part of the summary, 1 — the sentence is part of the summary) to be exploitable since classification models predict labels, not raw text.

For this, we convert our inherently abstractive datasets into extractive datasets by selecting combinations of sentences from the original document which are most similar to the human-written summaries (using ROUGE). This way to proceed is largely inspired by Yang Liu.’s paper (“Fine-tune BERT for Extractive Summarization”) [3].

From Abstractive dataset to Extractive dataset

3 — Tokenize

The sentences and words are segmented into tokens (AKA semantic or syntactic units, such as words), then a vocabulary is constituted to encode text into tokens and vice-versa.

4 — Train

The parameters of the models are initialized and are updated through the minimization of cost functions through updating strategies update of the parameters defined by the optimizers.

In order to ensure the stable convergence of the cost function towards the global minimum, several processes, and best practices have been implemented; here is a non-exhaustive list of them:

Gradient Limitation

Gradients are capped to a specific value to avoid their explosion or vanishing.

Gradient Accumulation

Gradient accumulation allows us to have an illimited batch size in theory by interrupting the model’s weights update for a set number of steps where the gradients are being computed and added up.

In the end of those steps, the weight update is resumed using the mean of the accumulated gradients.

Use of AdamW

In Pytorch, it may be judicious to use AdamW, as it corrects a subtle oversight in the implementation of Adam by separating weight decay and L2 regularization [1].

Learning rate variation

Instead of setting the learning rate as a hyperparameter, it evolves alongside the model during the entire training phase.

Following Vaswani et al. ‘s work on transformers [2], the learning rate lr is updated using the following formula.

Learning rate update formula

Where factor is a user-defined constant, iteration is the index of the current iteration, and warmup is the number of iterations necessary before the learning step goes from 0 to the lrinitial.

Regular evaluation and validation steps

In order to monitor the model’s performance during training, the cost function, ROUGE1, ROUGE2, and ROUGEL metrics are computed and stored in the hard drive disk at regular intervals using the validation set.

5 — Predict

Abstractive Summarization: Summaries are generated by predicting the next token in an autoregressive way. Parameters associated with generation tasks during precision are the same as any text generation problem (Beam Search, Length penalty, maximal/minimal length…).

Extractive Summarization: The sentence indexes that maximize the score given by the binary classifier are selected and concatenated to form the summary. The number of selected sentences is defined by the user.

Results

After fine-tuning our models using relevant hyperparameters for each one of the selected datasets and testing their performances on the appropriate test sets, we get the following table :

We observe that the abstractive methods display the best performance across all datasets.

However, we mustn’t forget that the calculation of ROUGE is done using the generated/extracted summary and the human-written summary, which may disadvantage models based on the extractive method, whose summaries are made of snippets taken from the original text, and thus, depending on the initial level of information condensation in the original text.

This explains why the extractive methods provide better results for datasets related to the press domain (CNN / DailyMail, XSum) compared to datasets from discussion boards (Reddit TIFU / Reddit TL; DR).

In other words, the reason is that, for the former, the information is presented in a digestible format, fluid and limpid. Since the articles are written by professionals, the information is already condensed at some key points, which makes it easy for extractive methods to select a few sentences which carry the entire text’s meaning.

This is usually not the case in datasets using more casual writing (Reddit TIFU), where information is more sparse.

Due to Reddit TL;DR dataset being too large for training (4 million rows), it was purely used to evaluate our models during inference.

Model inference results Results of the benchmark (Blue rows = Abstractive Methods / Purple rows = Extractive Methods)

We notice at first glance that the results are close to each other. Taking into account the observations of the first table, it is not surprising that the models abstracts trained on Reddit TIFU provide better results in terms of ROUGE score.

The reason for the drastic drop in ROUGE scores from the first table to the second one can be attributed to the difficulties of our models to generalize their vocabulary to the various lexical fields coming from various communities that make up the Reddit TL;DR dataset, ranging from video games to sport, science and general discussion.

After manually reading some parts of the dataset, we found out other reasons for both the decline in results and their similarity despite the use of different models :

The lack of standardization of summaries in terms of their length and/or content.
The relative subjectivity of the text and summaries through the use of humor and sarcasm.
The inclusion of the personal feelings of users as a summary, rather than providing a summary objective of their messages.

Those factors significantly hurt the model’s capacity to generalize to broader concepts.

Conclusion

We conclude that abstractive methods generally provide the best results for any type of text, but that the extractive methods are still more useful when we want to minimize potential divergences from the original meaning of the text due to their robustness.

For news articles and blog posts, I’d opt for the extractive method to get a comprehensible summary, whereas, in the case of forum posts and messages, I’d opt for the abstractive method.

References

[1] AdamW and Super-convergence is now the fastest way to train neural nets. (Link)

[2] Ashish Vaswani et al. “Attention Is All You Need”. In : CoRR abs/1706.03762 (2017). arXiv : 1706.03762

[3] Yang Liu. “Fine-tune BERT for Extractive Summarization”. arXiv : 1903.10318