Simple Text summarization using NLTK

Divakar P M
Analytics Vidhya
Published in
5 min readMay 9, 2020

--

Introduction

Text summarization is the process of shortening long pieces of text while preserving key information content and overall meaning, to create a subset (a summary) that represents the most important or relevant information within the Text.

Example for Automatic Text summarization

An example of Text summarization problem is news article summarization, which attempts to automatically produce an abstract from a given article. Sometimes one might be interested in generating a summary from a single source article, while others can use multiple source articles (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summarizing news articles. Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary. An example for this is App called inShorts, which summarizes news articles into 60 words.

Why Text summarization is important

The important uses of text summarization are,

· Can get maximum information by spending minimum time from unstructured textual data.

· To enhance the readability of the documents.

· Can eliminate redundant, insignificant text and provide required information

· Accelerates the process of researching for information

Two different approaches to Text Summarization

Extraction-based summarization: Here, content is extracted from the original data, but the extracted content is not modified in any way. In Simple words we identify the important sentences or key — phrases from the original text and extract only those from the text.

In machine learning, extractive summarization usually involves weighing the essential sections of sentences and using the results to generate summaries.

Example:

Before Summarization

John and Joseph took a taxi to attend the night party in the city. While in the party, John collapsed and was rushed to the hospital.

After Summarization

John and Joseph attend party. John rushed hospital.

Abstraction-based summarization: Here summary of the texts can be different from original text, which is contrast to extraction based summarization where which used only existing sentences that were present. Advanced deep learning techniques are used to generate the new summary.

Example:

Before Summarization

John and Joseph took a taxi to attend the night party in the city. While in the party, John collapsed and was rushed to the hospital.

After Summarization

John was hospitalized after attending the party.

In this article, we will use extraction based summarization by picking the sentences with maximum importance score to form the summary using NLTK toolkit.

Text Summarization of a Wikipedia article

Let’s create the text summarizer for the information found on https://en.wikipedia.org/wiki/Machine_learning Wikipedia article, which will give summary of machine learning.

Perquisites: Python 3, NLTK Toolkit

Steps involved to create the text summary

1) Data collection from Wikipedia using web scraping(using Urllib library)

2) Parsing the URL content of the data(using BeautifulSoup library)

3) Data clean-up like removing special characters, numeric values, stop words and punctuations.

4) Tokenization — Creation of tokens (Word tokens and Sentence tokens)

5) Calculate the word frequency for each word.

6) Calculate the weighted frequency for each sentence.

7) Creation of summary choosing 30% of top weighted sentences.

1) Data collection from Wikipedia using web scraping(using Urllib library)

Fetch the data from Wikipedia page using Urllib library, which will connect to the page and retrieves the HTML.

We’ll use the urlopen function from the urllib.request utility to open the web page. Then, we’ll use the read function to read the scraped data object.

2) Parsing the URL content of the data

This raw text consist of HTML tags and need to be parsed using BeautifulSoup library to get the content within <P> (Paragraph) tag.

The find_all function is used to return all the <p> elements present in the HTML. And by using .text enables us to select only the texts found within the <p> elements.

3) Tokenization & Data clean up

Import the stop words from NLTK toolkit and punctuations from strings library.

Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.

Creating the frequency table

Word tokenize the entire text. We have to create the dictionary with key as words and value as number of times word is repeated.

Then divide the number of occurrences of all the words by the frequency of the most occurring word, as shown below:

Tokenizing the article into sentences

To split the article_content into a set of sentences, we’ll use the built-in method from the nltk library.

4) Finding the weighted frequencies of the sentences

To evaluate the score for every sentence in the text, we’ll be analysing the frequency of occurrence of each term. In this case, we’ll be scoring each sentence by its words; that is, adding the frequency of each important word found in the sentence.

5) Creation of summary

Using nalargest library get the top 30% weighted sentences. And later on join it to get the final summarized text.

Check the text length before and after text summarization.

The word length of the text was 40679 before and 15472 after summarization.

Conclusion

This article explains simple method to summarize the Wikipedia article, my suggestion is to scrap any other article and check how it summarizes it. And please don’t forget to give claps. Thanks for reading.

Reference: https://en.wikipedia.org/wiki/Automatic_summarization#Applications

Github: https://github.com/DivakarPM/NLP/blob/master/Text_Summarization/Text_Summarization.ipynb

--

--

Divakar P M
Analytics Vidhya

Data science enthusiast. passionate about NLP and deep learning