Build an Automatic Abstractive Text Summarizer in Ten minutes

using Transformers, Torch, SentencePiece libraries in Python.

Published in

CodeX

4 min readSep 21, 2021

Text Summarization is the process of shortening long text into shorter text while retaining the key elements and meaning. Manually summarizing the text would take up a lot of time. Automating the process using state-of-the-art NLP models could be a solution to this.This problem can be solved using NLP in two ways:-

Extractive Text Summarization: This method aims to identify the most important sentences in a given text which is then extracted and grouped to form a shorter version of the text.

Abstractive Text Summarization: This method aims to create a brief and concise summary of a source text that captures the main points. The produced summaries may include additional phrases and sentences not found in the original text.

We are going to use Google’s PEGASUS model for our task at hand.

PEGASUS stands for Pre-training with Extracted Gap-sentences for Abstractive Summarization Sequence-to-sequence models. They designed a pre-training self-supervised objective function called gap sentence generation to train transformer models.

Pegasus Architecture

Pegasus has adapted the Transformer Architecture. It uses Encoder-Decoder model for seq2seq Learning. The text is parallelly fed into the Encoder, which produces a context vector. A context vector is nothing but an array of numbers assigned to each word as machines cannot understand text. This vector is then fed to the Decoder, who decodes the vector and produces the summary.To know more about Transformers, You can read this article “The Illustrated Transformer” By Jay Alammar

To know more about Pegasus, you can read this article.

PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization

Students are often tasked with reading a document and producing a summary (for example, a book report) to demonstrate…

ai.googleblog.com

Here is a step by step outline of the procedure we are going to follow:

Install the required libraries.
Import the required libraries.
Initialize the model.
Feed input to the model.
Obtain the summarized text.

The complete code can be found here.

Install the required libraries

Let’s install all the required libraries required for the task.

Uncomment the code install the required libraries. This code is for the CPU version. If you want to run it on a GPU, visit the PyTorch website and install the GPU version of it.

Import the required libraries

After the required packages have been installed, we will need to import them using the following lines of code.

The transformers library provides a variety of pre-trained AI models for various tasks. We are using the Pegasus model for our text summarization.

torch is an open-source machine learning library. It provides a wide range of algorithms for deep learning.

Initialize the model

After the required libraries are imported, we can initialize the model.

We are using the model which is pre-trained on the XSUM dataset.

XSUM dataset is a dataset for evaluation of abstractive documents. To know more about XSUM, you can read the following paper.

Papers with Code - XSum Dataset

The Extreme Summarization (XSum) dataset is a dataset for evaluation of abstractive single-document summarization…

paperswithcode.com

Tokenization is the process of splitting a phrase, sentence, paragraph, or an entire text document into smaller units. These smaller units are called tokens. We are using the pegasus tokenizer for this task.

The model configuration is stored and retrieved using the from_pretrained method.

The model is then initialized and can be accessed by using the model variable.

Feed input to the model

src_text is the source text which has to be summarized. You can replace this text with the text you want the model to summarize. The input cannot be fed all at once. We have to divide it into small batches and feed it into the model. We then create another variable named tgt_text that will store the output produced by the model.

Obtain the summarized text

After the model has summarized the text, it is stored in the tgt_text variable. We can see the output by calling the print function on it.

And your Abstractive text summarizer is ready without having to write 100’s of lines of code. We are using a pre-trained version of the model. The model can also be fine-tuned for a specific purpose if you require it.

Summary

Here’s a summary of the step-by-step process we followed to perform abstractive text summarization.

Installed the required libraries.
Imported the required libraries.
Initialized the model.
Feed the input to the model in batches.
Obtained the summarized text.