Understanding the BERTSUM Family: BERTSUM, BERTSUMABS, and BERTSUMEXTABS

Gundluru Chadrasekhar
Scavs.ai
Published in
4 min readJun 19, 2020
Photo by Hans-Peter Gauster on Unsplash

Before going through the introduction I recommend you guys to read this blog for a better understanding of abstractive summarization and some advanced techniques to generate good summaries

Introduction

Bidirectional Encoder Representations from Transformers (BERT) advanced a wide range of natural language processing tasks. Soon after the release of the paper describing the model, the team also open-sourced the code of the model and made available for download versions of the model that were already pre-trained on massive datasets. This is a momentous development since it enables anyone building a machine learning model involving language processing to use this powerhouse as a readily-available component — saving the time, energy, knowledge, and resources that would have gone to training a language-processing model from scratch and now it can be usefully applied in text summarization for both extractive and abstractive models.

For more details about BERT check out this beautifully written blog. Here I am assuming you are familiar with BERT.

BERTSUM :

Well, we know that extractive summarization is the selection of important sentences from the document. Consider a task of implementing extractive summarization on document1 containing sentences [sent1, sent2, · · ·, sentm].

Now we can assume extractive summarization as the task of binary classification assigning a label to each sentence whether the sentence should be included in the summary or not. It is assumed that summary sentences represent the most important content of the document.

The vector for the [CLS] symbol from the top layer can be used as the representation of the sentence.

yˆi = σ(W h + bo)

Where ‘h’ is the [CLS] vector for a sentence from the top layer of the Transformer. In experiments, it is found that Transformers with L = 1, 2, 3, and found that a Transformer with L = 2 performed best. The loss of the model is the binary classification entropy of prediction yˆi against true label yi. This model is named as BERTSUM.

BERTSUMABS:

Abstractive text summarization is the task of generating a headline or a short summary consisting of a few sentences that capture the salient ideas of an article or a passage. We use the adjective ‘abstractive’ to denote a summary that is not a mere selection of a few existing passages or sentences extracted from the source, but a compressed paraphrasing of the main contents of the document, potentially using vocabulary unseen in the source document.

Abstractive summarization is a very different problem from Machine Translation(MT). Unlike in MT, the target (summary) is typically very short and does not depend very much on the length of the source (document) in summarization. So it is considered a complicated task than Extractive summarization.

BERTSUMABS is trained for abstractive Summarization using a standard encoder-decoder framework. Here encoder is the pre-trained BERTSUM and the decoder is a 6-layered Transformer trained from scratch.

It is believable that there is a mismatch between the encoder and the decoder as the BERTSUM is pre-trained and the decoder must be trained from scratch. This can make fine-tuning uncertain. The encoder might overfit the data while the decoder under fits, or vice versa.

To avoid this, BERTSUMABS uses two separate optimizers for the encoder and the decoder.

BERTSUMEXTABS:

In addition to these two strategies, there is a two-stage fine-tuning approach, where BERTSUMEXTABS first fine-tune the encoder on the extractive summarization task and then fine-tune it on the abstractive summarization task. As using extractive intentions can boost the performance of abstractive summarization.

Code walkthrough to Summarize news article in 2 minutes using T5 transformer

Want to know more about abstractive summarization. Check out this blog with the state of the art techniques for Summarization.

Go over T5 Model Summary

Other works

Conclusion

I hope you guys got good intuition behind each terminology. Check this paper (Text Summarization with Pre-trained Encoders) for a more solid understanding of how BERT fine-tuned for summarization.

Reference

  1. www.appliedaicourse.com
  2. https://arxiv.org/pdf/1908.08345.pdf
  3. https://github.com/nlpyang/BertSum

--

--