BOW + TF-IDF in Python for unsupervised learning task

Published in

Betacom

9 min readSep 14, 2020

Introduction

Natural Language Processing (NLP) is a subfield of linguistics, computer science and artificial intelligence. One of the most interesting NLP tasks is to compute the similarity in meaning between texts i.e. to determine how close two texts are with respect to both lexical and semantic similarity.

The aim of this article is to solve an unsupervised machine learning problem of text similarity in Python. The model that we will define is based on two methods: the bag-of-words and the tf-idf.

The first two sections are about the bag of words and tf-idf methods respectively and can be skipped if you already know how they work. We will explain the fundamentals of the two methods and how they can be applied to a set of sentences.

In the third section the cosine similarity will be introduced since we will use it to determine if two texts are close or not.

In the fourth section, we will introduce a text similarity problem which will then be solved in the last section, combining the two models described before.

BOW

The bag-of-words (BOW) model is a method used in NLP and Information Retrieval (IR). In this model, each text is represented as a bag containing all its words regardless of grammar and their order.

It is commonly used in methods for documents classification where one of the features for training is the frequency of each word in the text.

Before digging into an example, let’s state some definitions:

document: text record;
corpus: collection of documents;
term: preprocessed word.

The BOW representation includes two things:

a vocabulary of known words,
a measure of the presence of known words.

Suppose we have the following documents:

Beth likes apples.
Sam does not like apples.

The vocabulary is then composed by the list of all the unique words in our corpus: Beth, likes, apples, Sam, does, not, like.

At the time of vocabulary creation, we can also preprocess the data depending on what we need for our problem. Some of the most famous preprocessing techniques are the following.

Lower casing. Each word is transformed into lower case:

beth likes apples.
sam does not like apples.

Removing punctuation. All punctuation is removed from the sentences:

beth likes apples
sam does not like apples

Stemming and lemmatization. Inflected words are reduced to their root form and grouped together so they can be analysed as a single item:

beth like apple
sam do not like apple

Removing stop words. Common words such as “the”, “a”, “be”, “do” are removed from the texts:

beth like apple
sam not like apple

Suppose we applied all the preprocessing filters listed above. The vocabulary will then be: beth, like, apple, sam, not.

The last step is transforming documents into vectors.

Each vector will have 5 entries since the vocabulary is made up of 5 words. The entries represent the number of occurrences of the corresponding word in the sentence. In our example we will have:

“Beth likes apples” → [1, 1, 1, 0, 0]
“Sam does not likes apples” → [0, 1, 1, 1, 1]

Tf-idf

Tf-idf stands for term frequency-inverse document frequency and is a method to measure the importance of a term with respect to a document or a collection of documents.

Since you may not be familiar with it, we would like to explain some terminology.

Raw term frequency: count of terms in a given document.
Term frequency (tf): normalized raw term frequency.
Document frequency (df): number of documents in a corpus that contain a given term.
Inverse document frequency (idf): weight that upweights terms that are less frequent in a corpus. It is the logarithmically scaled inverse fraction of the documents that contain the word.

The tf-idf is defined as the product of term frequency and inverse document frequency.

The purpose of this definition is to give more importance to terms which are present in the documents but are not frequent.

Suppose we have a corpus D made up of the two sentences from the example in the previous section:

Beth likes apples.
Sam does not like apples.

We will refer to them as d₁ and d₂.

The term frequency of the word “apples” is then

The inverse document frequency is

The tf-idf of the word “apples” in the corpus D is then

We will use the TfidfModel provided by the gensim package. For full documentation please check this page.

Cosine similarity

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space.

In our case, the inner product space is the one defined using the BOW and tf-idf models in which each vector represents a document.

The cosine similarity of two vectors is defined as cos(θ) where θ is the angle between the vectors. Using the Euclidean dot product formula, it can be written as:

Obviously it does not give us significant information, since the vocabulary contains five words and there are only the two sentences in the corpus. We just wanted to give you an example of how the cosine similarity is computed.

Problem description

Do you love movies but always forget their title? Well, it won’t be a problem anymore!

In this example we will perform document similarity between movies plots and a given query.

We will use the Wikipedia Movie Plots Dataset which is available at this page and consists in ~35000 movies.

Our solution

Let’s start installing the latest version of gensim and import all the packages we need.

!pip install --upgrade gensimimport pandas as pd
import gensim
from gensim.parsing.preprocessing import preprocess_documents

We can now load the dataset and store the plots into the corpus variable.

In order to avoid RAM saturation, we will only use movies with release year ≥ 2000.

df = pd.read_csv('wiki_movie_plots_deduped.csv', sep=',', usecols = ['Release Year', 'Title', 'Plot'])df = df[df['Release Year'] >= 2000]text_corpus = df['Plot'].values

The next step is to pre-process the documents. To do so, we will use gensim.parsing.preprocessing.preprocess_documents which will apply the following filters to all the documents:

strip_tags: returns a unicode string without tags,
strip_punctuation: replaces punctuation characters with spaces,
strip_multiple_whitespaces: removes non-alphabetic characters,
strip_numeric: removes digits,
remove_stopwords: removes stop words,
strip_short: removes words with length lesser than 3,
stem_text: transforms the document into lowercase and stems it.

Once the corpus has been preprocessed, we can create the dictionary and convert the corpus into vectors using the bag of words method.

processed_corpus = preprocess_documents(text_corpus)dictionary = gensim.corpora.Dictionary(processed_corpus)bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]

We will now load the tfidf model from the gensim library.

As you can see, we set the smartirs parameter to ‘npu’. This parameter is the SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System and is a mnemonic scheme for denoting tf-idf weighting variants in the vector space model.

The first letter refers to the term frequency weighting and setting it to “n” means the raw term frequency will be used.
The second letter refers to the document frequency weighting and setting it to “p” means the probabilistic idf will be used.
The third and last letter refers to the document normalization and setting it to “u” means the pivoted unique normalization will be used.

tfidf = gensim.models.TfidfModel(bow_corpus, smartirs='npu')

The next step is to transform the whole corpus via our model and index it, in preparation for similarity queries.

index = gensim.similarities.MatrixSimilarity(tfidf[bow_corpus])

Let’s now write the plot of the movie we would like to get from the similarity query. Suppose new_doc is the string containing the movie plot. The code to find the 10 most similar movies to it is the following:

new_doc = gensim.parsing.preprocessing.preprocess_string(new_doc)
new_vec = dictionary.doc2bow(new_doc)
vec_bow_tfidf = tfidf[new_vec]sims = index[vec_bow_tfidf]for s in sorted(enumerate(sims), key=lambda item: -item[1])[:10]:
    print(f"{df['Title'].iloc[s[0]]} : {str(s[1])}")

We will go through some examples.

Title: Inception.
Plot summary: “Infiltrate in minds and extract information through a shared dream world. Different levels of dreams”.
Results:
0.1846682 | Dream
0.1838959 | Inception
0.1837432 | Let’s Dance
0.1502161 | Swapner Din
0.1386628 | Dancing Queen
0.1363156 | Aalukkoru Aasai
0.1237058 | The Good Night
0.1127778 | Darwin
0.1091054 | Shab
0.1089082 | Days of Our Own
Title: Wreck-It Ralph.
Plot summary: “In the arcade at night the videogame character leave their games. The protagonist is a girl from a candy racing game who glitches”.
Results:
0.2111345 | Wreck-It Ralph
0.1654256 | Sipaayi
0.1640759 | Candy
0.1554006 | They Came Together
0.1461911 | Kami-sama no Iu Toori
0.1247127 | Confession
0.1138492 | Inferno
0.1115706 | How to Make a Monster
0.1115618 | Spring Breakers
0.1105373 | Molly’s Game
Title: Kill Bill.
Plot summary: “Blonde goes to Japan and kills many people”.
Results:
0.2402752 | A Boy and His Samurai
0.2356895 | Exam
0.1869075 | The Tokyo Trial
0.1634998 | Blonde and Blonder
0.1517287 | Inazuma Eleven: Saikyō Gundan Ōgre Shūrai
0.1480807 | Ōoku
0.1480807 | Ōoku: Emonnosuke Tsunayoshi Hen
0.1419084 | Vexille
0.1328358 | Coffin Baby
0.1308760 | Marrying the Mafia IV
Title: Kill Bill.
Plot summary: “Blonde bride goes to Japan and kills many people”.
Results:
0.2525054 | Kill Bill Volume 1
0.2525054 | Kill Bill Volume 2
0.2208529 | Bride Wars
0.2072306 | A Boy and His Samurai
0.2032755 | Exam
0.1612024 | The Tokyo Trial
0.1486499 | April Bride
0.1410139 | Blonde and Blonder
0.1313869 | Cake: A Wedding Story
0.1308617 | Inazuma Eleven: Saikyō Gundan Ōgre Shūrai
Title: Star Wars.
Plot summary: “knight lightsaber starship”.
Results:
0.1144212 | Summer in February
0.1006727 | Like Mike
0.0861193 | Franklin and the Green Knight
0.0820747 | Step Up: All In
0.0805561 | Magical Girl Lyrical Nanoha The Movie 2nd A’s
0.0783476 | The Amati Girls
0.0766701 | Season of the Witch
0.0720331 | King Arthur
0.0676626 | Star Wars: Episode I — The Phantom Menace 3D
0.0584484 | Star Wars: The Force Unleashed
Title: Star Wars.
Plot summary: “Jedi knight lightsaber starship”.
Results:
0.2096226 | Star Wars: The Clone Wars
0.1964447 | Star Wars: Episode I — The Phantom Menace 3D
0.1736593 | Star Wars: Episode II — Attack of the Clones
0.1713511 | Star Wars: The Force Unleashed
0.1227892 | Star Wars: Episode III — Revenge of the Sith
0.0986515 | Star Wars: The Last Jedi
0.0979152 | Summer in February
0.0861499 | Like Mike
0.0736961 | Franklin and the Green Knight
0.0702349 | Step Up: All In

Conclusions

The results obtained are not good in terms of cosine similarity. The scores are indeed very low and never greater than 0.3.

In the first two examples, even if the cosine similarity is not high, the movies we were looking for are in the top two positions.

In example 3 we tried to get “Kill Bill” using the query “Blonde goes to Japan and kills many people”. The movie was not on the top 10 matches so we modified the query adding the word “bride”. The new query (example 4) gave us very good results: the first matches were “Kill Bill Volume 1” and “Kill Bill Volume 2”.

Something similar happened with “Star Wars’’ in the last two examples. Without the word “Jedi” the first “Star Wars” movie retrieved was on position 9, whereas adding “Jedi” to the query put the movies on all the top six mathces.

Obviously we can do better! We will see how to improve this model in the next article.