Summarizing Lengthy Articles

Published in

Globant

6 min readAug 18, 2022

Automatic Summary Generation

In general the summary generation is the process where we try to explain the main idea in our words, including only the main points. If we have read a book or a story and someone asks us, what is the book/story about? then we don’t read the whole book/story to them, rather we present the key points in order to provide the context of the book or story.

This way of creating summaries by humans has introduced the concept of Automatic Summarization. As per Wikipedia -

Automatic Summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.

This includes video summarization, audio summarization, image summarization, Text summarization etc. In this article we are only going to discuss Text Summarization.

Text Summarization

Text summarization is the practice of breaking down long publications into manageable paragraphs or sentences. In Text Summarization, any of the lengthy document, article, text, blog, etc. get transformed into a concise and fluent summary without losing the key information and the overall meaning.

Why Text Summarization?

Text summarization can be useful in below scenarios -

Customer’s feedback — It helps in summarizing and shortening the text in the user feedback.

Summarizing blogs — There are discussions or comments coming after the blog post that are good sources of information to determine which parts of the blog are critical and interesting.

In scientific paper summarization — There is a considerable amount of information such as cited papers and conference information which can be leveraged to identify important sentences in the original paper.

Along with the above scenarios, it can summarize presentations, mails, large PDFs, online articles, Wikipedia content, or any large document.

Types of Text Summarization

There are two ways to summarize lengthy documents or text -

Extractive summarization
Abstractive summarization

Extractive Summarization

The name itself suggests the approach followed. In this type of summarization, all the important sentences or phrases are identified from the original text and extracted out. Those extracted sentences or phrases are then combined to create a summary.

So how do we identify which sentences or phrases are important and which are not? The sentences are ranked on the basis of few features. Below is one of the process -

I have used python here, but it can be done with other programming languages also. For below process, make sure that python is installed in system (Windows/Mac/Linux etc.)

Install below mentioned python libraries/packages-

nltk — This provides a set of diverse algorithms for Natural Language Processing.
numpy — This provides a multidimensional array object, as well as variations such as masks and matrices, which can be used for various math operations.
heapq — It implements all the low-level heap operations as well as some high-level common uses for heaps.
math — This is Python’s built-in module, which is used for advanced mathematical tasks.

Steps to follow -

Step — 1

In this step, we have used nltk’s sentence_tokenize function and broken-down the paragraph into multiple sentences and removed the punctuation marks.

Step — 2

In this step, we identified those sentences which contain words similar to the title of the document. Then a score is given to each sentence on the basis of the ratio of the number of common words occurring between the title and the sentences in the document to the total number of words in the document. A sentence scores high if it has the maximum number of words common to the title.

This part helps in cases where the document’s title gives reference about the content present in the document.

Step — 3

In this step, we calculated the term weight or term frequency. The number of times a term occurs in a document is called its term frequency. This is a ratio of the count of a particular word in the document with the total number of words in the document.

Step — 4

Here, we identified the parts of speech in a sentence and tagged each word accordingly. Then the sentences were ranked on the basis of the number of noun words present in it.

Step — 5

This is the step in which we generated a feature matrix using numpy library. The feature matrix is generated using the features calculated in above steps. Those features are — rank of a sentence on the basis of its similarity with title of the document, rank of a sentence on the basis of sum of the term weights, and rank of sentence on the basis of length of the noun words present in it.

Step — 6

In this step we created a dictionary of sentences and the feature matrix which we created in step — 5, calculated the sum of features of each sentence and sort them.

Step — 7

This is the final step in which we picked the top 30% of the highest ranked sentences and created a summary by combining those sentences. We have used heapq package here. The percentage of the picked sentences can be increased or decreased as per the requirement.

Conclusion

The extractive summarization process used here is just an example of how we can improve our efficiency and productivity by reducing the time we spend on reading lengthy articles, documents, books etc. There are other factors also on the basis of which sentences can be ranked. Also there are ready-made algorithms available, which do the same task we did in above steps and return us the summary. The only difference is that, in those algorithms we don’t know the factors used to create the summary. Here we have control over the factors and can increase or decrease the number of factors used.

What’s next?

To explore further on text summarization, do read my article on Abstractive summarization also, which briefly describes the process using HuggingFace Transformers.

References

Automatic summarization - Wikipedia

Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that…

en.wikipedia.org

How can you summarize a document even without reading it?

What would you do when you have too many documents to read but not enough time? How will you decide which are the ones…

medium.com

Natural Language Toolkit (NLTK) Tutorial with Python

What is NLTK? Accessing a dataset in NLTK Data pre-processing - Tokenization - Punctuation Removal - Stop Words Removal…

www.mygreatlearning.com

TF(Term Frequency)-IDF(Inverse Document Frequency) from scratch in python .

Creating TF-IDF Model from Scratch

towardsdatascience.com