Building a text summarizer in Python using NLTK and scikit-learn class TfidfVectorizer

Easy tutorial with downloadable working example.

Lucía Llavero Company
Saturdays.AI
6 min readMay 19, 2020

--

Photo by creativeart on Freepik.

If you would like to test the code below, please download the files facil_study_desktop.py and ai.txt in your machine. Do not forget to check README.md for setup instructions.

First of all we are going to shortly introduce two topics: Human Language Technologies (HLT) and the Automatic Generation of Human-like (abstractive) Summaries.

Human Language Technologies (HLT) are all technologies aimed at human language. Initially, they are born as a branch of Artificial Intelligence (AI) in order to provide machines the ability to process textual information and interact with its environment through some human language (Natural Language Processing).

Nowadays, Human Language Technologies are present in systems that we use in our daily life such as search engines, translators, chatbots, etc. These technologies are also used by companies to adjust their advertising messages to our profiles.

The Automatic Generation of Abstractive Summaries is something that we can accomplish with the aid of Natural Language Processing (NLP). It makes us capable of writing software that can summarize documents keeping the most important information in a short summary.

Next you are going to learn how to code a desktop app which summarize a text given. This text is based on Akash Panchal’s great tutorial: https://towardsdatascience.com/text-summarization-using-tf-idf-e64a0644ace3 , but this time we are using the popular Python machine learning library scikit-learn instead and, specifically, its TfidfVectorizer very useful class to save time and ensure optimal results. If you want to deepen in the TF-IDF mathematical basis, please check the URL above.

This article was made in cooperation with Miguel Ángel García Cumbreras, PhD in Computer Science, professor of Computer Languages and Systems at University of Jaén.

What does TF-IDF mean and what is this method for?

TF-IDF is the acronym of Term Frequency-Inverse document Frequency, and it is a measure used to evaluate how important a word is to a document in a collection or corpus. TF-IDF numerical statistic is used in information retrieval and text mining.

  • TF (Term Frequency): estimates how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than in shorter ones. Thus, the term frequency is often divided by the document length.
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
  • IDF (Inverse Document Frequency): which measures how important a term is. While some terms are considered equally important, other ones like “is”, “of” or “that” are less important but they may appear a lot of times in a document. Thus, we need to weigh down the frequent terms while scale up the rare ones by computing the following:
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

Example:

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., TF) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., IDF) is calculated as log(10,000,000 / 1,000) = 4. Thus, the TF-IDF weight is the product of these quantities: 0.03 * 4 = 0.12.

Please go to: http://www.tfidf.com/ for further information about TF-IDF method.

Let’s start coding:

1. Getting TF-IDF results for a given text

Firstly, we will tokenize the text by sentences to get documentsarray, which means that we use the sent_tokenizefunction from NLTK natural language processing library to divide the text into sentences. We are loading textfrom ai.txt file. From now on, we will always name documents to the sentences.

documents = nltk.sent_tokenize(text)

Then class TfidfVectorizerworks out the TF-IDF results from documents for the lemmatized tokens. As it might be formed different words with the same lexeme, to ease the text analysis, nltk.sent_tokenize()function takes the words lexeme. As previously mentioned, it is very important to the TF-IDF method to know which words appear the most in the text so if in the text appear study and studying, only taking the words lexeme, the program will treat them as the same word.

We will also ignore English stop words (note that you can specify the language by writing it between the quotes) and therefore we will not take into account terms which do not add any relevant information like “the” or “at”.

tfidf_results = TfidfVectorizer(tokenizer = get_lemmatized_tokens, stop_words = stopwords.words(‘english’)).fit_transform(documents)

tfidf_results is an sparse matrix; in numerical analysis, a sparse matrix is an special type of matrix in which most of the elements are zeros. The rows of the matrix match to the documents and the columns contain the TF-IDF values of the words.

2. How to access the tfidf_results sparse matrix elements?

To check the number of rows and columns, we will use the shape attribute of the sparse matrix: tfidf_results.shape.

The tf_idfsparse matrix contains a matrix for each document, which maintains tuples with the document index and a word which TF-IDF value is over 0 and the TF-IDF value of that word. To do it in an easiest way, we will convert the matrix into a common array using .toarray()function. This function returns an array which contains another array with the TF-IDF values of a document. Thus, we will need to get data from its first position: tfidf_results[i, :].toarray()[0].

Finally, to obtain the TF-IDF value of a particular word, we will do: tfidf_results[i, :].toarray()[0][j].

3. We need a threshold

The threshold measures the TF-IDF value that a document needs to be in the summary (it decides what documents will be in the summary). For that, we need the TF-IDF value of the documents, but currently we only have the TF-IDF value of the words in a document, so calculating the average of these values we get the TF-IDF value of a document.

Then, the threshold will be the average of the TF-IDF values of all documents, hence, we will create the function below:

4. Finally getting our summary

Lastly, to get the summary we will set the condition whereby if the average of a document exceeds or matches the threshold, the document will be added to the summary.

To provide the user the chance to choose the summary extension, we will create a handicap, a variable that will take values from zero to one and will let us increase or decrease the final threshold.

Example:

If handicapwas 0.75, a document with a TF-IDF value greater or equal than the 75% of what get_threshold(tfidf_results)would returns, will be also added into the summary and as a consequence, the text will be larger.

5. Summary Example

Here’s a summarizing example for the same text file from the beginning of this tutorial: ai.txt. The original text has 338 words and this summary only 222 for a handicap value of 0.85.

This time, for the same text we get a summary of 138 words when handicap is 1. As you can see, and as I said before, higher handicap values return shorter texts.

That’s all! I hope you find it useful and if you enjoyed it and you still want some more, find the Flask based web version of this project at: github.com/LuciaLlavero/facil_study .

About Saturdays.AI

Saturdays.AI is an impact-focused organization on a mission to empower diverse individuals to learn Artificial Intelligence in a collaborative and project-based way, beyond the conventional path.

--

--

Lucía Llavero Company
Saturdays.AI

My name is Lucía Llavero Company and I’m a Spanish high-school software developer.