Summarize Documents using Tf-Idf

5 min readJun 14, 2016

In this post I’m going to explain how to use python and a natural language processing (NLP) technique known as Term Frequency — Inverse Document Frequency (tf-idf) to summarize documents. I’ll be using sklearn along with nltk to accomplish this task.

The techniques outlined in this post were largely taken from this paper.

Our goal is to take a given document, wether that’s a blog post, news article, random website, or any other corpus, and extract a few sentences that best summarized the document.

Before going through the code, we first need to understand how tf-idf works. A Term Frequency is a count of how many times a word occurs in a given document (synonymous with bag of words). The Inverse Document Frequency is the the number of times a word occurs in a corpus of documents. tf-idf is used to weight words according to how important they are. Words that are used frequently in many documents will have a lower weighting while infrequent ones will have a higher weighting. Below is an explanation from Wikipedia.

The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

tf-idf is used in a number of NLP techniques such as text mining, search queries and summarization. It’s also very intuitive, but please note that you need to take the natural log (normalization) of the inverse document frequencies if you’re going to implement your own version. You can read about why this is necessary in this stack overflow post.

Now that we’ve discussed the basic mechancis behind tf-idf, lets talk about the general process we’ll follow to summarize a document. Below is the outline of steps needed:

Preprocess the document
Import a corpus used for training
Create a count vector
Build a tf-idf matrix
Score each sentence
Summarize using top ranking sentences

Preprocessing the document

There are a number of preprocessing techniques that can be applied to a document. The most obvious being the removal of non alpha-numeric characters, stop words, and any unnecessary punctuation. When doing NLP, we often find ourselves working with tokenized sentences. Sentences are often separated (tokenized) where ever there is a period. If there are any extra periods within a sentence, for example those used in acronyms (ie U.S.A.) we’ll incorrectly split the sentence. Thus it’s good practice to explore the types of documents you’ll be working to identify these nuances so that they can be properly addressed. Below is an example of how to remove a few of these “dangerous” punctuation marks.

document = re.sub(‘[^A-Za-z .-]+’, ‘ ‘, document)
document = document.replace(‘-’, ‘’)
document = document.replace(‘…’, ‘’)
document = document.replace(‘Mr.’, ‘Mr’).replace(‘Mrs.’, ‘Mrs’)

Gather Corpus for training

There are many places to find a corpus that can be used to fit the Tfidf Transformer. Examples include:

Wikicorpus — over 750 million words
New York Times Corpus — over 1.8 million articles
Brown corpus — contains text from over 500 sources

Just find one that matches your requirements. Obviously choose one within the context of the documents you’re trying to summarize. You probably wont get a good summary of a blog post if you fit the model using a chat log corpus.

Creating a count vector

To create a count vector (a.k.a. bag of words) we’ll need to implement sklearn’s CountVectorizer and fit it with the corpus and the new document we’re looking to summarize.

from sklearn.feature_extraction.text import CountVectorizercount_vect = CountVectorizer()
count_vect = count_vect.fit(train_data)
freq_term_matrix = count_vect.transform(train_data)

Building the tf-idf matrix

Creating the tf-idf matrix is as simple as passing the freq_term_matrix we defined above into TfidfTransformer’s fit method.

from sklearn.feature_extraction.text import TfidfTransformertfidf = TfidfTransformer(norm=”l2")
tfidf.fit(freq_term_matrix)

Once we have the tf-idf transformer fitted, we can take the original document, vectorize it, and transform it into a tf-idf matrix.

doc_freq_term = count_vect.transform([doc])
doc_tfidf_matrix = tfidf.transform(doc_freq_term)

Note that that output of doc_tfidf_matrix will be a matrix with a single row because we have only passed in one document.

Scoring each sentence

To rank each sentence we need to score each sentence using the tf-idf values calculated above. Rather than simply taking the summation of all the values for a given sentence, we’ll be using some additional techniques outlined in this paper. These include:

Only summing tf-idf values where the underling word is a noun. This total is then divided by the summation of all the document tf-idf values.
Add an additional value to a given sentence if it has any words that are included in the title of the document. This value is equal to the count of all words in a sentence found in the title divided by the total number of words in the title. This “heading similarity score” is then multiplied by an arbitrary constant (0.1) and added to the tf-idf value.
Apply a position weighting. Order each sentence from 0 to 1 equally based on the sentence number in the document. For example if there are 10 sentences in a document, sentence nine’s “position weighting” would be 0.9. This weighting is then multiplied by the value calculated in point 2.

Filtering for nouns is simple enough. You can tag each word using NLTK’s Part of Speech tagger or build a custom n-gram tagger. I covered both of these in my previous post. After you’ve tagged the sentences, it’s as simple as looking up the index value (bag of word mapping) for each word in a sentence and finding the tf-idf score in doc_tfidf_matrix.

tfidf_sent = [[doc_matrix[feature_names.index(w.lower())]
               for w in sent if w.lower() in feature_names]
               for sent in sentences]

Calculating the heading similarity score is fairly straight forward. Here is some sample code:

And here’s how to calculate the position weights:

ranked_sents = [sent*(i+1/len(sent_values))
                for i, sent in enumerate(sent_values)]

After applying the above we can finally sort our sentences in descending order and choose the top 3 (or 4, 5, 6 …). And boom, we have a summary based on the most important sentences found in a document.

I hope the descriptions above were enough to get you started using tf-idf and doing basic document summarization. All of the code for this project can be found here.

Summarize Documents using Tf-Idf

Written by Alexander Crosson