Text Summarization

Mohit Sharma
Incedge & Co.
Published in
6 min readSep 15, 2018
Text Summarization

I’m back with another article. Today, I’ll tell you how to do Text-Summarization.

After reading my article you’ll learn

  • What is Text-Summarization
  • How to extract data from the website
  • How to clean the data
  • How to build the histogram
  • How to calculate the sentence score
  • How to extract only topmost sentences/ short summary

Before moving forward I’ll highly recommend to go and take some familiarity with

Interested in learning Deep Learning with PyTorch buy now -

What is text-summarization

Text summarization is the process of shortening a text document, in order to create a summary of the major points of the original document.

The main idea of summarization is to find a subset of data which contains the “information” of the entire set. Such techniques are widely used in industry today. Search engines are an example; others include the summarization of documents, image collections, and videos. Document summarization tries to create a representative summary or abstract of the entire document, by finding the most informative sentences, while in image summarization the system finds the most representative and important (i.e. salient) images. For surveillance videos, one might want to extract the important events from the uneventful context.

There are two general approaches to automatic summarization: extraction and abstraction. For more refer to Wikipedia.

How to extract data from the website?

Step 1 : Importing the libraries/packages

  • Beautiful Soup(bs) is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
  • Urllib is a package that collects several modules for working with URLs:

urllib.request for opening and reading URLs

urllib.error containing the exceptions raised by urllib.request

urllib.parse for parsing URLs

urllib.robotparser for parsing robots.txt files

  • re this module provides regular expression matching operations similar to those found in Perl.
  • nltk is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.
  • heapq this module provides an implementation of the heap queue algorithm, also known as the priority queue algorithm.
import bs4 as bs
import urllib.request
import re
import nltk
import heapq

Checking out whether stopwords and punkt is up to date or not!

nltk.download('stopwords')
nltk.download('punkt')
Figure 1

Step 2: Extract the data

I’ve taken Artificial Neural Network Wikipedia page for my work. You can take any article depending upon your need.

page = urllib.request.urlopen("https://en.wikipedia.org/wiki/Artificial_neural_network").read()
soup = bs.BeautifulSoup(page,'lxml')
print(page) #print the page
Figure 2

You can see we extract the content but it’ does not look good. We use BeautifulSoup library to parse the document and extract the text in a beautiful manner. I also use prettify to make the syntax even look more better.

print(soup.prettify)
Figure 3

Note: I get all the content of the page because most of the articles in Wikipedia are written under < p > tag but it may vary according to the different websites. For instance, some Websites write under < div > tag.

text = ""
for paragraph in soup.find_all('p'):
text += paragraph.text
print(text)
Figure 4

How to clean the data

Step 3: Data Cleaning

Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

Line 1: I’ve tried to remove all the references in the text which is denoted by [1],[2] etc. (see the text output above)

Line 2: I removed all the extra spaces with single space

Line 3: converted into lower case

Line 4,5,6: I removed all the extra punctuation, digits, extra spaces etc.

Line 7: break the all big text into sentences using sent_tokenize()

text = re.sub(r'\[[0-9]*\]',' ',text)            
text = re.sub(r'\s+',' ',text)
clean_text = text.lower()
clean_text = re.sub(r'\W',' ',clean_text)
clean_text = re.sub(r'\d',' ',clean_text)
clean_text = re.sub(r'\s+',' ',clean_text)
sentences = nltk.sent_tokenize(text)stop_words = nltk.corpus.stopwords.words('english')print(sentences)
Figure 5
stop_words  #list
Figure 6

How to build the histogram

Step 4: Build the histogram

Line 1: create an empty dictionary

Line 2: use for loop and using word_tokenize break down the clean _text into words and put into word

Line 3: check if condition if word not present in stop_words

then again check if condition and check word not in word2count.keys()

(if this condition satisfy then) put word2count[word]=1,

else word2count[word]+=1

Line 4: calculated the weighted histogram (see the output you can see the weights not the counts for example — ‘artificial’:0.3620689 etc.)

word2count = {}  #line 1for word in nltk.word_tokenize(clean_text):     #line 2
if word not in stop_words: #line 3
if word not in word2count.keys():
word2count[word]=1
else:
word2count[word]+=1
for key in word2count.keys(): #line 4
word2count[key]=word2count[key]/max(word2count.values())
Figure 7

How to calculate the sentence score

Step 5: Calculating the Sentence score

Line 1: create an empty dictionary

Line 2: use for loop and put sentences into sentence (step 3 where we created the sentences)

Line 3: converted into lower case and tokenize tinto word and put into word

Line 4: use if condition and checks whether word in word2count.keys() or not

Line 5: I’ve specified the length here less than 30, you can choose depending upon your needs

Line 6: Again use if-else condition and put sent2score[sentence]=word2count[word] if sentence not present in sentence2keys()

else, sent2score[sentence]+=word2count[word]

# Calculate the score

sent2score = {}
for sentence in sentences:
for word in nltk.word_tokenize(sentence.lower()):
if word in word2count.keys():
if len(sentence.split(' '))<30:
if sentence not in sent2score.keys():
sent2score[sentence]=word2count[word]
else:
sent2score[sentence]+=word2count[word]

See the sentence score

Figure 8

How to extract only topmost sentences/ short summary

Step 6: Find out the best sentences

I’ve used heapq to find seven best sentences from the Wikipedia (ANN)article.

best_sentences = heapq.nlargest(7,sent2score,key=sent2score.get)for sentences in best_sentences:
print(sentences,'\n')
Best Seven Sentences of Artificial Neural Network

That’s it for today. Source code can be found on Github. I am happy to hear any questions or feedback.

Hope you like this article!! Don’t forget to like this article and share with others.

Thank You

Go Subscribe THEMENYOUWANTTOBE

Show Some Love ❤

--

--