Automatic Extractive Text Summarization using TF-IDF
In the recent years, information grows rapidly along with the development of social media. With the increasing amount of information, it takes more effort and time to review the entire text document and understand its contents. One possible solution to the above problem is to read the summary of the document. The summary will not only retain the essence of the document, but will also save a lot of time and effort. An effective summary of the document will concise and fluent while preserving key information and overall meaning.
What is extractive text summarization and how it works?
There are two major text summarization approaches, abstractive and extractive summarization. The approach of Abstractive summarization selects words on the basis of semantic understanding, and even includes those words which do not appear in the original text. On the other hand, extractive summarization extracts the most important and meaningful sentences from the text document and forms a summary.
Extractive summarization works as follows:
Input document -> Finding most important words from the document -> Finding sentence scores on the basis of important words ->Choosing the most important sentences on the basis of scores obtained.
Extractive Text Summarization Model Pipeline
As shown in the figure above, this approach considers only nouns and verbs to compute the sentence score, as it gives greater accuracy and improves the speed of the algorithm.
What is TFIDF Approach ?
TFIDF, short for term frequency–inverse document frequency, is a numeric measure that is use to score the importance of a word in a document based on how often did it appear in that document and a given collection of documents. The intuition behind this measure is : If a word appears frequently in a document, then it should be important and we should give that word a high score. But if a word appears in too many other documents, it’s probably not a unique identifier, therefore we should assign a lower score to that word.
Formula for calculating tf and idf:
- TF(w) = (Number of times term w appears in a document) / (Total number of terms in the document)
- IDF(w) = log_e(Total number of documents / Number of documents with term w in it)
Hence tfidf for a word can be calculated as:
TFIDF(w) = TF(w) * IDF(w)
Sample input document
Women education is a catch all term which refers to the state of primary, secondary, tertiary and health education in girls and women. There are 65 Million girls out of school across the globe; majority of them are in the developing and underdeveloped countries. All the countries of the world, especially the developing and underdeveloped countries must take necessary steps to improve their condition of female education; as women can play a vital role in the nation’s development.If we consider society as tree, then men are like its strong main stem which supports the tree to face the elements and women are like its roots; most important of them all. The stronger the roots are the bigger and stronger the tree will be spreading its branches; sheltering and protecting the needy.Women are the soul of a society; a society can well be judged by the way its women are treated. An educated man goes out to make the society better, while an educated woman; whether she goes out or stays at home, makes the house and its occupants better.Women play many roles in a society- mother, wife, sister, care taker, nurse etc. They are more compassionate towards the needs of others and have a better understanding of social structure. An educated mother will make sure that her children are educated, and will weigh the education of a girl child, same as boys.History is replete with evidences, that the societies in which women were treated equally to men and were educated; prospered and grew socially as well as economically. It will be a mistake to leave women behind in our goal of sustainable development, and it could only be achieved if both the genders are allowed equal opportunities in education and other areas.Education makes women more confident and ambitious; they become more aware of their rights and can raise their voice against exploitation and violence. A society cannot at all progress if its women weep silently. They have to have the weapon of education to carve out a progressive path for their own as well as their families.
Retention rate given by the user : 56%
Summary obtained:
Women education is a catch all term which refers to the state of primary, secondary, tertiary and health education in girls and women. All the countries of the world, especially the developing and underdeveloped countries must take necessary steps to improve their condition of female education; as women can play a vital role in the nation’s development. The stronger the roots are the bigger and stronger the tree will be spreading its branches; sheltering and protecting the needy. Women are the soul of a society; a society can well be judged by the way its women are treated. An educated man goes out to make the society better, while an educated woman; whether she goes out or stays at home, makes the house and its occupants better. Women play many roles in a society- mother, wife, sister, care taker, nurse etc. An educated mother will make sure that her children are educated, and will weigh the education of a girl child, same as boys. A society cannot at all progress if its women weep silently.
Can’t wait to start coding…..
Let’s get started….
Python code for Automatic Extractive Text Summarization using TFIDF
Step 1- Importing necessary libraries and initializing WordNetLemmatizer
The most important library for working with text in python is NLTK. It stands for Natural Language toolkit. It contains methods such as sent_tokenize, word_tokenize in the corpus package, which can split the text into sentences and words respectively. Stem package of NLTK contains methods for lemmatization, namely WordNetLemmatizer . Stopwords contains list of english stop words, which needs to be removed during preprocessing step.
import nltk
import os
import re
import math
import operator
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
Stopwords = set(stopwords.words('english'))
wordlemmatizer = WordNetLemmatizer()
Step 2- Text preprocessing
Text pre-processing is the most crucial step in getting consistent and good analytical result. The pre-processing steps applied in this algorithm include, removing special characters, digits and one letter word and stop words from the text .
file = 'input.txt'
file = open(file , 'r')
text = file.read()
tokenized_sentence = sent_tokenize(text)
text = remove_special_characters(str(text))
text = re.sub(r'\d+', '', text)
tokenized_words_with_stopwords = word_tokenize(text)
tokenized_words = [word for word in tokenized_words_with_stopwords if word not in Stopwords]
tokenized_words = [word for word in tokenized_words if len(word) > 1]
tokenized_words = [word.lower() for word in tokenized_words]
The first step includes, reading text from a file. Here we store the contents of the file in variable text. After reading the contents, remove_special_characters function removes special characters from the text. It is important to remove digits from the document, which can be done using regular expression. After eliminating special character and digits, the individual words can be tokenized and one letter word, stop words can be removed.To avoid any ambiguity in case, we lowercase all the tokenized words. The remove special characters function is as follows:
def remove_special_characters(text):
regex = r'[^a-zA-Z0-9\s]'
text = re.sub(regex,'',text)
return text
Step-3 Calculating the frequency of each word in the document
While working with text it becomes important to calculate the frequency of words, to find the most common or least common words based on the requirement of the algorithm.
def freq(words):
words = [word.lower() for word in words]
dict_freq = {}
words_unique = []
for word in words:
if word not in words_unique:
words_unique.append(word)
for word in words_unique:
dict_freq[word] = words.count(word)
return dict_freq
Here we take the list of words as input and append all the unique words in a new list.The unique words are stored in words_unique list. After finding the unique words, the frequency of the word can be found using count function.
Step-4 Calculating sentence score
As the score given to each sentence decides the importance of the sentence, it becomes extremely important to choose the correct algorithm to find the score. In this approach, we will be using TFIDF score of each word to calculate the total sentence score.
def sentence_importance(sentence,dict_freq,sentences):
sentence_score = 0
sentence = remove_special_characters(str(sentence))
sentence = re.sub(r'\d+', '', sentence)
pos_tagged_sentence = []
no_of_sentences = len(sentences)
pos_tagged_sentence = pos_tagging(sentence)
for word in pos_tagged_sentence:
if word.lower() not in Stopwords and word not in Stopwords and len(word)>1:
word = word.lower()
word = wordlemmatizer.lemmatize(word)
sentence_score = sentence_score + word_tfidf(dict_freq,word,sentences,sentence)
return sentence_score
The score of each sentence can be calculated using sentence_importance function. It involves POS tagging of words in the sentence by pos_tagging function.This function returns only the noun and verb tagged words. The returned words from pos_tagging function are sent to word_tfidf function to calculate the score of that word in the document by calculating its tfidf score.
Let’s explore all the functions one by one..
- POS tagging function
def pos_tagging(text):
pos_tag = nltk.pos_tag(text.split())
pos_tagged_noun_verb = []
for word,tag in pos_tag:
if tag == "NN" or tag == "NNP" or tag == "NNS" or tag == "VB" or tag == "VBD" or tag == "VBG" or tag == "VBN" or tag == "VBP" or tag == "VBZ":
pos_tagged_noun_verb.append(word)
return pos_tagged_noun_verb
This function uses nltk library to pos tag all the words in the text and returns only the nouns and verbs from the text.
Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com
2. Word tfidf function
def word_tfidf(dict_freq,word,sentences):
word_tfidf = []
tf = tf_score(word,sentence)
idf = idf_score(len(sentences),word,sentences)
tf_idf = tf_idf_score(tf,idf)
return tf_idf
The above function calls tf_score, idf_score and tf_idf function.tf_score calculates the tf score, idf_score calculates the idf score and tf_idf_score calculates the tfidf score. It returns the tfidf score.
3. tf score function
def tf_score(word,sentence):
freq_sum = 0
word_frequency_in_sentence = 0
len_sentence = len(sentence)
for word_in_sentence in sentence.split():
if word == word_in_sentence:
word_frequency_in_sentence = word_frequency_in_sentence + 1
tf = word_frequency_in_sentence/ len_sentence
return tf
This function calculates the tf score of a word. tf is calculated as the number of times the word appears in the sentence upon the total number of words in the sentence.
4. idf score function
def idf_score(no_of_sentences,word,sentences):
no_of_sentence_containing_word = 0
for sentence in sentences:
sentence = remove_special_characters(str(sentence))
sentence = re.sub(r'\d+', '', sentence)
sentence = sentence.split()
sentence = [word for word in sentence if word.lower() not in Stopwords and len(word)>1]
sentence = [word.lower() for word in sentence]
sentence = [wordlemmatizer.lemmatize(word) for word in sentence]
if word in sentence:
no_of_sentence_containing_word = no_of_sentence_containing_word + 1
idf = math.log10(no_of_sentences/no_of_sentence_containing_word)
return idf
This function finds the idf score of the word, by dividing the total number of sentences by number of sentences containing the word and then taking a log10 of that value.
5. tfidf score function
def tf_idf_score(tf,idf):
return tf*idf
This function returns the tfidf score, by simply multiplying the tf and idf values.
Step 5 Finding most important sentences
To find the most important sentences, take the individual sentences from tokenized sentences and compute the sentence score. After calculating the scores, the top sentences based on the retention rate provided by the user are included in the summary.
for sent in tokenized_sentence:
sentenceimp = sentence_importance(sent,word_freq,tokenized_sentence)
sentence_with_importance[c] = sentenceimp
c = c+1
sentence_with_importance = sorted(sentence_with_importance.items(), key=operator.itemgetter(1),reverse=True)
cnt = 0
summary = []
sentence_no = []
for word_prob in sentence_with_importance:
if cnt < no_of_sentences:
sentence_no.append(word_prob[0])
cnt = cnt+1
else:
break
sentence_no.sort()
cnt = 1
for sentence in tokenized_sentence:
if cnt in sentence_no:
summary.append(sentence)
cnt = cnt+1summary = " ".join(summary)
print("\n")
print("Summary:")
print(summary)outF = open('summary.txt',"w")
outF.write(summary)
Here number of sentences are calculated as follows:
input_user = int(input('Percentage of information to retain(in percent):'))
no_of_sentences = int((input_user * len(tokenized_sentence))/100)
Number of sentences to include in the summary, takes into consideration the retention rate provided by the user.
The complete code is given as follows:
import nltk
import os
import re
import math
import operator
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize,word_tokenize
nltk.download('averaged_perceptron_tagger')
Stopwords = set(stopwords.words('english'))
wordlemmatizer = WordNetLemmatizer()def lemmatize_words(words):
lemmatized_words = []
for word in words:
lemmatized_words.append(wordlemmatizer.lemmatize(word))
return lemmatized_wordsdef stem_words(words):
stemmed_words = []
for word in words:
stemmed_words.append(stemmer.stem(word))
return stemmed_wordsdef remove_special_characters(text):
regex = r'[^a-zA-Z0-9\s]'
text = re.sub(regex,'',text)
return textdef freq(words):
words = [word.lower() for word in words]
dict_freq = {}
words_unique = []
for word in words:
if word not in words_unique:
words_unique.append(word)
for word in words_unique:
dict_freq[word] = words.count(word)
return dict_freqdef pos_tagging(text):
pos_tag = nltk.pos_tag(text.split())
pos_tagged_noun_verb = []
for word,tag in pos_tag:
if tag == "NN" or tag == "NNP" or tag == "NNS" or tag == "VB" or tag == "VBD" or tag == "VBG" or tag == "VBN" or tag == "VBP" or tag == "VBZ":
pos_tagged_noun_verb.append(word)
return pos_tagged_noun_verbdef tf_score(word,sentence):
freq_sum = 0
word_frequency_in_sentence = 0
len_sentence = len(sentence)
for word_in_sentence in sentence.split():
if word == word_in_sentence:
word_frequency_in_sentence = word_frequency_in_sentence + 1
tf = word_frequency_in_sentence/ len_sentence
return tfdef idf_score(no_of_sentences,word,sentences):
no_of_sentence_containing_word = 0
for sentence in sentences:
sentence = remove_special_characters(str(sentence))
sentence = re.sub(r'\d+', '', sentence)
sentence = sentence.split()
sentence = [word for word in sentence if word.lower() not in Stopwords and len(word)>1]
sentence = [word.lower() for word in sentence]
sentence = [wordlemmatizer.lemmatize(word) for word in sentence]
if word in sentence:
no_of_sentence_containing_word = no_of_sentence_containing_word + 1
idf = math.log10(no_of_sentences/no_of_sentence_containing_word)
return idfdef tf_idf_score(tf,idf):
return tf*idfdef word_tfidf(dict_freq,word,sentences,sentence):
word_tfidf = []
tf = tf_score(word,sentence)
idf = idf_score(len(sentences),word,sentences)
tf_idf = tf_idf_score(tf,idf)
return tf_idfdef sentence_importance(sentence,dict_freq,sentences):
sentence_score = 0
sentence = remove_special_characters(str(sentence))
sentence = re.sub(r'\d+', '', sentence)
pos_tagged_sentence = []
no_of_sentences = len(sentences)
pos_tagged_sentence = pos_tagging(sentence)
for word in pos_tagged_sentence:
if word.lower() not in Stopwords and word not in Stopwords and len(word)>1:
word = word.lower()
word = wordlemmatizer.lemmatize(word)
sentence_score = sentence_score + word_tfidf(dict_freq,word,sentences,sentence)
return sentence_scorefile = 'input.txt'
file = open(file , 'r')
text = file.read()
tokenized_sentence = sent_tokenize(text)
text = remove_special_characters(str(text))
text = re.sub(r'\d+', '', text)
tokenized_words_with_stopwords = word_tokenize(text)tokenized_words = [word for word in tokenized_words_with_stopwords if word not in Stopwords]tokenized_words = [word for word in tokenized_words if len(word) > 1]tokenized_words = [word.lower() for word in tokenized_words]tokenized_words = lemmatize_words(tokenized_words)word_freq = freq(tokenized_words)input_user = int(input('Percentage of information to retain(in percent):'))no_of_sentences = int((input_user * len(tokenized_sentence))/100)print(no_of_sentences)c = 1sentence_with_importance = {}for sent in tokenized_sentence:
sentenceimp = sentence_importance(sent,word_freq,tokenized_sentence) sentence_with_importance[c] = sentenceimp c = c+1sentence_with_importance = sorted(sentence_with_importance.items(), key=operator.itemgetter(1),reverse=True)cnt = 0
summary = []
sentence_no = []for word_prob in sentence_with_importance:
if cnt < no_of_sentences:
sentence_no.append(word_prob[0])
cnt = cnt+1
else:
breaksentence_no.sort()
cnt = 1
for sentence in tokenized_sentence:
if cnt in sentence_no:
summary.append(sentence)
cnt = cnt+1summary = " ".join(summary)
print("\n")
print("Summary:")
print(summary)
outF = open('summary.txt',"w")
outF.write(summary)
Here input.txt is the file containing the document to be summarized and summary.txt stores the summary of the input file.
If this article helped you, please like and share with others. Feel free to write suggestions as well in the comments below!
Thank You