Text Summarization

4 min readApr 29, 2020

Using TF-IDF based Sentence Scoring

Brief

Sentence scoring using tf-idf is one of the extractive approaches for text summarization. TF-IDF stands for Term Frequency — Inverse Document Frequency. It is the product of two statistics.

Term Frequency (TF) : It is the number of times the word occurs in the document.
Inverse Document Frequency (IDF) : It is the measure of how much information the word provides, i.e., if it’s common or rare across all documents.

Steps

Covert text to sentences : Converting a single text to list of sentences.
Pre-process text : Clean the sentences by removing unnecessary words, stopwords, punctuations, etc.
Create term frequency (tf) matrix : It shows the frequency of words in each sentence. We will calculate relative frequency to represent the tf instead of using actual frequency. It is calculated as t / T where,
t = Number of times the term appears in the document
T = Total number of terms in the document
Create idf matrix : It shows the importance of words in each sentence with respect to the whole document. It is calculated as log_e(D/d) where,
D = Total number of documents
d = Number of documents with term t in it
Calculate sentence tf-idf : It is the product of tf and idf for each word in the sentence and shows the importance of each word in the sentence.
Calculate sentence scores : Here score of the sentences are calculated as the average of the tf-idf value of words in the sentence. It is calculated as
T / n where,
T = Total tf-idf of words in the sentence
n = Number of distinct words in the sentence
Determine threshold : Threshold is the average value of the scores of the sentences. It is calculated as S / s where,
S = Total sum of scores of sentences
s = Number of sentences
Generate summary : Generate a summary by extracting the sentences having scores greater than the threshold value.

Implementation

Requirement
Python, NLTK library, Math library

Code

1.Convert text to sentences

from nltk.tokenize import sent_tokenizetext = "Your text goes here."
sentences = sent_tokenize(text)

2. Pre-process text

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmerps = PorterStemmer()def text_preprocessing(sentences):
    """
    Pre processing text to removeunnecessary words.
    """    
    print('Preprocessing text')    
    stop_words = set(stopwords.words('english'))
    clean_words = []
    for sent in sentences:
        words = word_tokenize(sent)
        words = [ps.stem(word.lower()) for word in words if word.isalnum()]
        clean_words += [word for word in words if word not in stop_words]
    return clean_words

3.Create term frequency (tf) matrix

def create_tf_matrix(sentences: list) -> dict:
    """
    Here document refers to a sentence.
    TF(t) = (Number of times the term t appears in a document) /         (Total number of terms in the document)
    """
    print('Creating tf matrix.')    tf_matrix = {}    for sentence in sentences:
        tf_table = {}        words_count = len(word_tokenize(sentence))
        clean_words = text_preprocessing([sentence])        # Determining frequency of words in the sentence
        word_freq = {}
        for word in clean_words:
            word_freq[word] = (word_freq[word] + 1) if word in  word_freq else 1        # Calculating tf of the words in the sentence
        for word, count in word_freq.items():
            tf_table[word] = count / words_count        tf_matrix[sentence[:15]] = tf_table    return tf_matrix

4.Create idf matrix

import mathdef create_idf_matrix(sentences: list) -> dict:
    """
    IDF(t) = log_e(Total number of documents / Number of documents     with term t in it)
    """
    print('Creating idf matrix.')

    idf_matrix = {}

    documents_count = len(sentences)
    sentence_word_table = {}

    # Getting words in the sentence
    for sentence in sentences:
        clean_words = text_preprocessing([sentence])
        sentence_word_table[sentence[:15]] = clean_words

    # Determining word count table with the count of sentences which contains the word.
    word_in_docs = {}
    for sent, words in sentence_word_table.items():
        for word in words:
            word_in_docs[word] = (word_in_docs[word] + 1) if word in word_in_docs else 1

    # Determining idf of the words in the sentence.
    for sent, words in sentence_word_table.items():
        idf_table = {}
        for word in words:
            idf_table[word] = math.log10(documents_count / float(word_in_docs[word]))

        idf_matrix[sent] = idf_table

    return idf_matrix

5.Calculate sentence tf-idf

def create_tf_idf_matrix(tf_matrix, idf_matrix) -> dict:
    """
    Create a tf-idf matrix which is multiplication of tf * idf individual words
    """
    print('Calculating tf-idf of sentences.')

    tf_idf_matrix = {}

    for (sent1, f_table1), (sent2, f_table2) in zip(tf_matrix.items(), idf_matrix.items()):
        tf_idf_table = {}

        for (word1, value1), (word2, value2) in zip(f_table1.items(), f_table2.items()):
            tf_idf_table[word1] = float(value1 * value2)

        tf_idf_matrix[sent1] = tf_idf_table

    return tf_idf_matrix

6.Calculate sentence scores

def create_sentence_score_table(tf_idf_matrix) -> dict:
    """
    Determining average score of words of the sentence with its words tf-idf value.
    """
    print('Creating sentence score table.')

    sentence_value = {}

    for sent, f_table in tf_idf_matrix.items():
        total_score_per_sentence = 0

        count_words_in_sentence = len(f_table)
        for word, score in f_table.items():
            total_score_per_sentence += score

        sentence_value[sent] = total_score_per_sentence / count_words_in_sentence

    return sentence_value

7.Determine threshold

def find_average_score(sentence_value):
    """
    Calculate average value of a sentence form the sentence score table.
    """
    print('Finding average score')

    sum = 0
    for val in sentence_value:
        sum += sentence_value[val]

    average = sum / len(sentence_value)

    return average

8.Generate summary

def generate_summary(sentences, sentence_value, threshold):
    """
    Generate a sentence for sentence score greater than average.
    """
    print('Generating summary')

    sentence_count = 0
    summary = ''

    for sentence in sentences:
        if sentence[:15] in sentence_value and sentence_value[sentence[:15]] >= threshold:
            summary += sentence + " "
            sentence_count += 1

    return summary

Full Code Implementation

Get the full code here.

Text Summarization

Written by Ashin Shakya