Text Summarization

Mohit Sharma

Published in

Incedge & Co.

6 min readSep 15, 2018

I’m back with another article. Today, I’ll tell you how to do Text-Summarization.

After reading my article you’ll learn

What is Text-Summarization
How to extract data from the website
How to clean the data
How to build the histogram
How to calculate the sentence score
How to extract only topmost sentences/ short summary

Before moving forward I’ll highly recommend to go and take some familiarity with

Interested in learning Deep Learning with PyTorch buy now -

Deep Learning with PyTorch: A practical approach to building neural network models using PyTorch

Build neural network models in text, vision and advanced analytics using PyTorch Key Features Learn PyTorch for…

www.amazon.in

What is text-summarization

Text summarization is the process of shortening a text document, in order to create a summary of the major points of the original document.

The main idea of summarization is to find a subset of data which contains the “information” of the entire set. Such techniques are widely used in industry today. Search engines are an example; others include the summarization of documents, image collections, and videos. Document summarization tries to create a representative summary or abstract of the entire document, by finding the most informative sentences, while in image summarization the system finds the most representative and important (i.e. salient) images. For surveillance videos, one might want to extract the important events from the uneventful context.

There are two general approaches to automatic summarization: extraction and abstraction. For more refer to Wikipedia.

How to extract data from the website?

Step 1 : Importing the libraries/packages

Beautiful Soup(bs) is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
Urllib is a package that collects several modules for working with URLs:

urllib.request for opening and reading URLs

urllib.error containing the exceptions raised by urllib.request

urllib.parse for parsing URLs

urllib.robotparser for parsing robots.txt files

re this module provides regular expression matching operations similar to those found in Perl.
nltk is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.
heapq this module provides an implementation of the heap queue algorithm, also known as the priority queue algorithm.

import bs4 as bs
import urllib.request
import re
import nltk
import heapq

Checking out whether stopwords and punkt is up to date or not!

nltk.download('stopwords')
nltk.download('punkt')

Step 2: Extract the data

I’ve taken Artificial Neural Network Wikipedia page for my work. You can take any article depending upon your need.

page = urllib.request.urlopen("https://en.wikipedia.org/wiki/Artificial_neural_network").read()
soup = bs.BeautifulSoup(page,'lxml')print(page)     #print the page

You can see we extract the content but it’ does not look good. We use BeautifulSoup library to parse the document and extract the text in a beautiful manner. I also use prettify to make the syntax even look more better.

print(soup.prettify)

Note: I get all the content of the page because most of the articles in Wikipedia are written under < p > tag but it may vary according to the different websites. For instance, some Websites write under < div > tag.

text = ""
for paragraph in soup.find_all('p'):
    text += paragraph.textprint(text)

How to clean the data

Step 3: Data Cleaning

Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

Line 1: I’ve tried to remove all the references in the text which is denoted by [1],[2] etc. (see the text output above)

Line 2: I removed all the extra spaces with single space

Line 3: converted into lower case

Line 4,5,6: I removed all the extra punctuation, digits, extra spaces etc.

Line 7: break the all big text into sentences using sent_tokenize()

text = re.sub(r'\[[0-9]*\]',' ',text)            
text = re.sub(r'\s+',' ',text)    clean_text = text.lower()
clean_text = re.sub(r'\W',' ',clean_text)
clean_text = re.sub(r'\d',' ',clean_text)
clean_text = re.sub(r'\s+',' ',clean_text)sentences = nltk.sent_tokenize(text)stop_words = nltk.corpus.stopwords.words('english')print(sentences)

stop_words  #list

How to build the histogram

Step 4: Build the histogram

Line 1: create an empty dictionary

Line 2: use for loop and using word_tokenize break down the clean _text into words and put into word

Line 3: check if condition if word not present in stop_words

then again check if condition and check word not in word2count.keys()

(if this condition satisfy then) put word2count[word]=1,

else word2count[word]+=1

Line 4: calculated the weighted histogram (see the output you can see the weights not the counts for example — ‘artificial’:0.3620689 etc.)

word2count = {}  #line 1for word in nltk.word_tokenize(clean_text):     #line 2
    if word not in stop_words:                  #line 3
        if word not in word2count.keys():
            word2count[word]=1
        else:
            word2count[word]+=1for key in word2count.keys():                   #line 4
    word2count[key]=word2count[key]/max(word2count.values())

How to calculate the sentence score

Step 5: Calculating the Sentence score

Line 1: create an empty dictionary

Line 2: use for loop and put sentences into sentence (step 3 where we created the sentences)

Line 3: converted into lower case and tokenize tinto word and put into word

Line 4: use if condition and checks whether word in word2count.keys() or not

Line 5: I’ve specified the length here less than 30, you can choose depending upon your needs

Line 6: Again use if-else condition and put sent2score[sentence]=word2count[word] if sentence not present in sentence2keys()

else, sent2score[sentence]+=word2count[word]

# Calculate the score
         
sent2score = {}
for sentence in sentences:
    for word in nltk.word_tokenize(sentence.lower()):
        if word in word2count.keys():
            if len(sentence.split(' '))<30:
                if sentence not in sent2score.keys():
                     sent2score[sentence]=word2count[word]
                else:
                    sent2score[sentence]+=word2count[word]

See the sentence score

How to extract only topmost sentences/ short summary

Step 6: Find out the best sentences

I’ve used heapq to find seven best sentences from the Wikipedia (ANN)article.

best_sentences = heapq.nlargest(7,sent2score,key=sent2score.get)for sentences in best_sentences:
    print(sentences,'\n')

Best Seven Sentences of Artificial Neural Network

That’s it for today. Source code can be found on Github. I am happy to hear any questions or feedback.

Hope you like this article!! Don’t forget to like this article and share with others.

Thank You
Go Subscribe THEMENYOUWANTTOBE
Show Some Love ❤

Text Summarization

Deep Learning with PyTorch: A practical approach to building neural network models using PyTorch

Build neural network models in text, vision and advanced analytics using PyTorch Key Features Learn PyTorch for…

Step 1 : Importing the libraries/packages

You can see we extract the content but it’ does not look good. We use BeautifulSoup library to parse the document and extract the text in a beautiful manner. I also use prettify to make the syntax even look more better.

Written by Mohit Sharma