Text Summarization

Ashin Shakya
4 min readApr 29, 2020

--

Using TF-IDF based Sentence Scoring

Brief

Sentence scoring using tf-idf is one of the extractive approaches for text summarization. TF-IDF stands for Term Frequency — Inverse Document Frequency. It is the product of two statistics.

  1. Term Frequency (TF) : It is the number of times the word occurs in the document.
  2. Inverse Document Frequency (IDF) : It is the measure of how much information the word provides, i.e., if it’s common or rare across all documents.

Steps

  1. Covert text to sentences : Converting a single text to list of sentences.
  2. Pre-process text : Clean the sentences by removing unnecessary words, stopwords, punctuations, etc.
  3. Create term frequency (tf) matrix : It shows the frequency of words in each sentence. We will calculate relative frequency to represent the tf instead of using actual frequency. It is calculated as t / T where,
    t = Number of times the term appears in the document
    T = Total number of terms in the document
  4. Create idf matrix : It shows the importance of words in each sentence with respect to the whole document. It is calculated as log_e(D/d) where,
    D = Total number of documents
    d = Number of documents with term t in it
  5. Calculate sentence tf-idf : It is the product of tf and idf for each word in the sentence and shows the importance of each word in the sentence.
  6. Calculate sentence scores : Here score of the sentences are calculated as the average of the tf-idf value of words in the sentence. It is calculated as
    T / n where,
    T = Total tf-idf of words in the sentence
    n = Number of distinct words in the sentence
  7. Determine threshold : Threshold is the average value of the scores of the sentences. It is calculated as S / s where,
    S = Total sum of scores of sentences
    s = Number of sentences
  8. Generate summary : Generate a summary by extracting the sentences having scores greater than the threshold value.

Implementation

Requirement
Python, NLTK library, Math library

Code

1.Convert text to sentences

from nltk.tokenize import sent_tokenizetext = "Your text goes here."
sentences = sent_tokenize(text)

2. Pre-process text

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
ps = PorterStemmer()def text_preprocessing(sentences):
"""
Pre processing text to removeunnecessary words.
"""
print('Preprocessing text')
stop_words = set(stopwords.words('english'))
clean_words = []
for sent in sentences:
words = word_tokenize(sent)
words = [ps.stem(word.lower()) for word in words if word.isalnum()]
clean_words += [word for word in words if word not in stop_words]
return clean_words

3.Create term frequency (tf) matrix

def create_tf_matrix(sentences: list) -> dict:
"""
Here document refers to a sentence.
TF(t) = (Number of times the term t appears in a document) / (Total number of terms in the document)
"""
print('Creating tf matrix.')
tf_matrix = {} for sentence in sentences:
tf_table = {}
words_count = len(word_tokenize(sentence))
clean_words = text_preprocessing([sentence])
# Determining frequency of words in the sentence
word_freq = {}
for word in clean_words:
word_freq[word] = (word_freq[word] + 1) if word in word_freq else 1
# Calculating tf of the words in the sentence
for word, count in word_freq.items():
tf_table[word] = count / words_count
tf_matrix[sentence[:15]] = tf_table return tf_matrix

4.Create idf matrix

import mathdef create_idf_matrix(sentences: list) -> dict:
"""
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
"""
print('Creating idf matrix.')

idf_matrix = {}

documents_count = len(sentences)
sentence_word_table = {}

# Getting words in the sentence
for sentence in sentences:
clean_words = text_preprocessing([sentence])
sentence_word_table[sentence[:15]] = clean_words

# Determining word count table with the count of sentences which contains the word.
word_in_docs = {}
for sent, words in sentence_word_table.items():
for word in words:
word_in_docs[word] = (word_in_docs[word] + 1) if word in word_in_docs else 1

# Determining idf of the words in the sentence.
for sent, words in sentence_word_table.items():
idf_table = {}
for word in words:
idf_table[word] = math.log10(documents_count / float(word_in_docs[word]))

idf_matrix[sent] = idf_table

return idf_matrix

5.Calculate sentence tf-idf

def create_tf_idf_matrix(tf_matrix, idf_matrix) -> dict:
"""
Create a tf-idf matrix which is multiplication of tf * idf individual words
"""
print('Calculating tf-idf of sentences.')

tf_idf_matrix = {}

for (sent1, f_table1), (sent2, f_table2) in zip(tf_matrix.items(), idf_matrix.items()):
tf_idf_table = {}

for (word1, value1), (word2, value2) in zip(f_table1.items(), f_table2.items()):
tf_idf_table[word1] = float(value1 * value2)

tf_idf_matrix[sent1] = tf_idf_table

return tf_idf_matrix

6.Calculate sentence scores

def create_sentence_score_table(tf_idf_matrix) -> dict:
"""
Determining average score of words of the sentence with its words tf-idf value.
"""
print('Creating sentence score table.')

sentence_value = {}

for sent, f_table in tf_idf_matrix.items():
total_score_per_sentence = 0

count_words_in_sentence = len(f_table)
for word, score in f_table.items():
total_score_per_sentence += score

sentence_value[sent] = total_score_per_sentence / count_words_in_sentence

return sentence_value

7.Determine threshold

def find_average_score(sentence_value):
"""
Calculate average value of a sentence form the sentence score table.
"""
print('Finding average score')

sum = 0
for val in sentence_value:
sum += sentence_value[val]

average = sum / len(sentence_value)

return average

8.Generate summary

def generate_summary(sentences, sentence_value, threshold):
"""
Generate a sentence for sentence score greater than average.
"""
print('Generating summary')

sentence_count = 0
summary = ''

for sentence in sentences:
if sentence[:15] in sentence_value and sentence_value[sentence[:15]] >= threshold:
summary += sentence + " "
sentence_count += 1

return summary
https://www.buymeacoffee.com/ashinsk

Full Code Implementation

Get the full code here.

--

--