Text Summarization
4 min readApr 29, 2020
Using TF-IDF based Sentence Scoring
Brief
Sentence scoring using tf-idf is one of the extractive approaches for text summarization. TF-IDF stands for Term Frequency — Inverse Document Frequency. It is the product of two statistics.
- Term Frequency (TF) : It is the number of times the word occurs in the document.
- Inverse Document Frequency (IDF) : It is the measure of how much information the word provides, i.e., if it’s common or rare across all documents.
Steps
- Covert text to sentences : Converting a single text to list of sentences.
- Pre-process text : Clean the sentences by removing unnecessary words, stopwords, punctuations, etc.
- Create term frequency (tf) matrix : It shows the frequency of words in each sentence. We will calculate relative frequency to represent the tf instead of using actual frequency. It is calculated as t / T where,
t = Number of times the term appears in the document
T = Total number of terms in the document - Create idf matrix : It shows the importance of words in each sentence with respect to the whole document. It is calculated as log_e(D/d) where,
D = Total number of documents
d = Number of documents with term t in it - Calculate sentence tf-idf : It is the product of tf and idf for each word in the sentence and shows the importance of each word in the sentence.
- Calculate sentence scores : Here score of the sentences are calculated as the average of the tf-idf value of words in the sentence. It is calculated as
T / n where,
T = Total tf-idf of words in the sentence
n = Number of distinct words in the sentence - Determine threshold : Threshold is the average value of the scores of the sentences. It is calculated as S / s where,
S = Total sum of scores of sentences
s = Number of sentences - Generate summary : Generate a summary by extracting the sentences having scores greater than the threshold value.
Implementation
Requirement
Python, NLTK library, Math library
Code
1.Convert text to sentences
from nltk.tokenize import sent_tokenizetext = "Your text goes here."
sentences = sent_tokenize(text)
2. Pre-process text
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmerps = PorterStemmer()def text_preprocessing(sentences):
"""
Pre processing text to removeunnecessary words.
"""
print('Preprocessing text')
stop_words = set(stopwords.words('english'))
clean_words = []
for sent in sentences:
words = word_tokenize(sent)
words = [ps.stem(word.lower()) for word in words if word.isalnum()]
clean_words += [word for word in words if word not in stop_words]
return clean_words
3.Create term frequency (tf) matrix
def create_tf_matrix(sentences: list) -> dict:
"""
Here document refers to a sentence.
TF(t) = (Number of times the term t appears in a document) / (Total number of terms in the document)
"""
print('Creating tf matrix.') tf_matrix = {} for sentence in sentences:
tf_table = {} words_count = len(word_tokenize(sentence))
clean_words = text_preprocessing([sentence]) # Determining frequency of words in the sentence
word_freq = {}
for word in clean_words:
word_freq[word] = (word_freq[word] + 1) if word in word_freq else 1 # Calculating tf of the words in the sentence
for word, count in word_freq.items():
tf_table[word] = count / words_count tf_matrix[sentence[:15]] = tf_table return tf_matrix
4.Create idf matrix
import mathdef create_idf_matrix(sentences: list) -> dict:
"""
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
"""
print('Creating idf matrix.')
idf_matrix = {}
documents_count = len(sentences)
sentence_word_table = {}
# Getting words in the sentence
for sentence in sentences:
clean_words = text_preprocessing([sentence])
sentence_word_table[sentence[:15]] = clean_words
# Determining word count table with the count of sentences which contains the word.
word_in_docs = {}
for sent, words in sentence_word_table.items():
for word in words:
word_in_docs[word] = (word_in_docs[word] + 1) if word in word_in_docs else 1
# Determining idf of the words in the sentence.
for sent, words in sentence_word_table.items():
idf_table = {}
for word in words:
idf_table[word] = math.log10(documents_count / float(word_in_docs[word]))
idf_matrix[sent] = idf_table
return idf_matrix
5.Calculate sentence tf-idf
def create_tf_idf_matrix(tf_matrix, idf_matrix) -> dict:
"""
Create a tf-idf matrix which is multiplication of tf * idf individual words
"""
print('Calculating tf-idf of sentences.')
tf_idf_matrix = {}
for (sent1, f_table1), (sent2, f_table2) in zip(tf_matrix.items(), idf_matrix.items()):
tf_idf_table = {}
for (word1, value1), (word2, value2) in zip(f_table1.items(), f_table2.items()):
tf_idf_table[word1] = float(value1 * value2)
tf_idf_matrix[sent1] = tf_idf_table
return tf_idf_matrix
6.Calculate sentence scores
def create_sentence_score_table(tf_idf_matrix) -> dict:
"""
Determining average score of words of the sentence with its words tf-idf value.
"""
print('Creating sentence score table.')
sentence_value = {}
for sent, f_table in tf_idf_matrix.items():
total_score_per_sentence = 0
count_words_in_sentence = len(f_table)
for word, score in f_table.items():
total_score_per_sentence += score
sentence_value[sent] = total_score_per_sentence / count_words_in_sentence
return sentence_value
7.Determine threshold
def find_average_score(sentence_value):
"""
Calculate average value of a sentence form the sentence score table.
"""
print('Finding average score')
sum = 0
for val in sentence_value:
sum += sentence_value[val]
average = sum / len(sentence_value)
return average
8.Generate summary
def generate_summary(sentences, sentence_value, threshold):
"""
Generate a sentence for sentence score greater than average.
"""
print('Generating summary')
sentence_count = 0
summary = ''
for sentence in sentences:
if sentence[:15] in sentence_value and sentence_value[sentence[:15]] >= threshold:
summary += sentence + " "
sentence_count += 1
return summary
Full Code Implementation
Get the full code here.