Abstractive Text Summarization

Mitesh Dewda
Globant
Published in
6 min readAug 18, 2022
Many open books placed randomly on each other
Photo by Gülfer ERGİN on Unsplash

There are two main approaches to automatically summarize the text - Abstractive and Extractive. The main difference between them is how information is extracted from the document and how the summary is generated. I have tried to explain the Extractive summarization in this article

In this article we are going to discuss Abstractive summarization. This method tries to rewrite and reformulate text from the original document, which is more similar to how a human does summarization.

Abstractive summarization concentrates on the most critical information in the original text and creates a new set of sentences for the summary. This technique entails identifying key pieces, interpreting the context, and re-creating them in a new way. Due to the difficulty of both extracting relevant information from a document as well as automatically generating coherent text, abstractive summarization has been considered a more complex problem than extractive summarization. The abstractive summarization method works well with deep learning models like the seq2seq model, LSTM, etc., along with popular Python packages (Spacy, NLTK, etc.) and frameworks (Tensorflow, Keras)

Ways to do Abstractive Summarization

So now when we know what Abstractive summarization is, how to do it ? There are various ways of doing Abstractive Summarization, in-fact there are citations written on the number of ways of doing it. But we are not going to mention all those here, because if we do so then this article itself will need a summarization!! 😜

We have used HuggingFace’s Transformers library to perform abstractive summarization. Transformers provide us with thousands of pre-trained models, which can be used for text summarization as well as for a wide variety of NLP tasks such as text classification, question answering, translation, speech recognition, optical character recognition etc.

What is Transformers?

The name reminds us of the famous science fiction movie Transformers, but unfortunately, here we will not be able to see any of those cool robots transforming their shapes 🤖. Although we will see a transformation here, not of robots but of all those lengthy texts or documents, whose summary we want to generate.

Huggingface Transformers is a very popular library which contains pre-trained models that have revolutionized NLP in the last few years. These models have gained widespread attention in the ML community. Pre-trained means that these models are trained on some dataset (large amount of raw text) in a self-supervised fashion. In this training the objective is automatically computed from the inputs of the model.

Summarization Process

The simplest way to summarize using the Transformers library is to use the “summarization pipeline” with existing summarization models. We can import the pipeline from transformers and provide a “summarization” task as a string argument to the pipeline. As we don’t specify any model, the pipeline will use the default model sshleifer/distilbart-cnn-12–6.

In the above code, Pipeline is a simple API that abstracts most of the complex code of generating a summary and just takes a single parameter, which is a task defining the pipeline returned. In the above example, we have used a “summarization” task, which will return a SummarizationPipeline.

Using Trained Models

We are going to use Google’s T5 Model from Transformers along with its tokenizer. As per the Transformers official documentation — “T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format.”

Fully understood, right !!! 😄.

In simple terms, this is a Text-toText transfer Transformer, where all NLP tasks are reframed into a unified text-to-text-format, where input and output are always text strings.

source: Google AI Blog

Below is the process to generate a summary using the T5ForConditionalGeneration model -

  • First we will import the Model and tokenizer then Initialize the model architecture and weights and then initialize the Tokenizer.

In the above code we have used from_pretrained() method to load the pre-trained t5-base model. The T5 model comes in different sizes, such as — t5-small, t5-base, t5-large, t5–3b and t511b. The models differ in the size of the dataset they are trained on and the accuracy of the summary they generate.

We have also initialized a Tokenizer. Tokenizers are very important to translate the text into data that can be processed by the model. Models only understand and process numbers, so tokenizers convert the provided input to numerical data.

The first time when we execute this code, it will take much time because the t5-base model will be downloaded along with its weights (vocabulary) and configuration.

  • We then create a variable to store the input text and use the above created tokenizer to encode the text to tokens.

In the above code we have used encode() method and passed following parameters -

  • return_tensors: If set, will return tensors instead of a list of python integers. We can also use “tf” for TensorFlow, pt” for PyTorch and “np” for Numpy.
  • max_length: Controls the maximum length to use by one of the truncation/padding parameters.
  • truncation: Activates and controls truncation.

The max_length and truncation parameters together indicate that we do not want the original text to bypass 512 tokens, which is the default limit set for tokenization in Transformers.

The final step is to use the model.generate() method to generate the summarized output.

Output

“Junk food is fried food found in the market in packets. It is high in calories, high in cholesterol, low in healthy nutrients, low in sodium mineral, high in sugar, starch, unhealthy fat and lack of dietary fibers.”

Limitations

Even though the above mentioned process generates a well transformed summary of a lengthy article, there are few limitations of this process.

  • If the input text is too lengthy and the tokenization process crosses the default set limit of 512 tokens, then we don’t have any control over the output of the generated summary. Changing the parameter values in the model.generate() method, doesn’t have any effect on the generated output.
  • So we can’t summarize a very lengthy document using this. Even if we want to, then we need to break the content into chunks and create a summary of those chunks. Finally we combine that to create a full summary of the input given. This would make the process a tedious task.
  • And as the abstractive summarization doesn’t just extract the already present sentences but rather generates its own sentences using the trained models, sometimes the generated summary contains those words which don’t match with the context of the input provided. This happens because of the DataSet on which the Models have been trained on.

The Pre-trained Models can be retrained using a custom DataSet, but training a Model is very costly in terms of memory used, processing power, processing time etc. We need very powerful GPUs. There are solutions available on Cloud though but those also involve a huge cost.

Conclusion

These Extractive and Abstractive summarization articles include all the basics required to summarize any piece of text. Although these are not the only ways to summarize text, these will definitely help to create a summary of lengthy text without losing the quality of information. In this fast-paced world, this can help in saving time.

References

--

--