Word Embeddings using BOW, Tf-IDF with an example


arihant jain
6 min readNov 14, 2018

Before we start, have a look at the below examples.

  1. You open Google and search for a news article on the ongoing Champions trophy and get hundreds of search results in return about it.
  2. Nate silver analyzed millions of tweets and correctly predicted the results of 49 out of 50 states in 2008 U.S Presidential Elections.
  3. You type a sentence in google translate in English and get an Equivalent Chinese conversion.

Humans can deal with text format quite intuitively but provide we have millions of documents being generated in a single day, we cannot have humans performing the above the three tasks. It is neither scalable nor effective.

So, how do we make computers of today perform clustering, classification etc on a text data since we know that they are generally inefficient at handling and processing strings or texts to give efficient outputs?

Sure, a computer can match two strings and tell you whether they are same or not. But how do we make computers tell us about football or Ronaldo when you search for Messi? How do you make a computer understand that “Apple” in “Apple is a tasty fruit” is a fruit that can be eaten and not a company?

The answer to the above questions lies in creating a representation for words that capture their meanings, semantic relationships and the different types of contexts they are used in.

And all of these are implemented by using Word Embedding or numerical representations of texts so that computers may handle them.

Now you have a question in mind that What is Word Embeddings and Why do we need this?

In very simplistic terms, Word Embeddings are the texts converted into numbers and there may be different numerical representations of the same text. As it turns out, many Machine Learning algorithms and almost all Deep Learning Architectures are incapable of processing strings or plain text in their raw form. They require numbers as inputs to perform any sort of job, be it a classification, regression etc. in broad terms. And with the huge amount of data that is present in the text format, it is imperative to extract knowledge out of it and build applications. Some real-world applications of text applications are — sentiment analysis of reviews by Amazon, document or news classification or clustering by Google etc.

Now let’s get started with these techniques to get better insights

1. Bag of Words

Bag of Words (BOW) is an algorithm that counts how many times a word appears in a document. Those word counts allow us to compare documents and gauge their similarities for applications like search, document classification, and topic modeling.

BOW is a method for preparing text for input in a deep-learning net.

Now let’s take an example to get more about the BOW

I have taken text from ‘A Tale of Two Cities’ and apply the BOW technique. For this, I have used the CountVectorizer. CountVectorizer converts a collection of text to a matrix of token counts. It takes a word from each text excluding stop words (for example-what, was, the etc) as there is no sense to count stop words that’s why I am not counting the stop words. After counting the words, it forms a Sparse matrix. Now you have a question in mind what is a Sparse Matrix?. A sparse matrix is a matrix which has very few non-zero elements. Now let’s see the count of words matrix looks like by making the data frame of this with the help of pandas.

In Data frame each row represent the given text in ‘data’ ( which I have taken as input string in code) and columns represent the unique words from the given string of list and values shown in the Data Frame table is occurrence of words. for example in row 1 we have string (data[0] in python ) “It was the best of times” from this string we have ‘best’ and ‘times’ as unique words not including stop words and we get corresponding ‘1’ as values because it occurs only one time in this particular string. If it occurs two times then it will show as the count is ‘2’. Please feel free to experiment with the code because there is no other way to learn until you write a code by yourself.

2. Tf-Idf

Tf-Idf is shorthand for term frequency-inverse document frequency. So, two things: term frequency and inverse document frequency.
Term frequency (TF) is basically the output of the BoW model. For a specific document, it determines how important a word is by looking at how frequently it appears in the document. Term frequency measures the importance of the word. If a word appears a lot of times, then the word must be important. For example, if our document is “I am a cat lover. I have a cat named Steve. I feed a cat outside my room regularly,” we see that the words with the highest frequency are I, a, and cat. This agrees with our intuition that high term frequency = higher importance.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

IDF used to calculate the weight of rare words across all documents. The words that occur rarely in the corpus have a high IDF score. However, it is known that certain terms, such as “I”, “a” may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones

IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

Now let’s take an example to get more about the Tf-Idf

I have done all the same step just like in BOW but this time I have taken the TfidfVectorizer which calculate the Tf-Idf vector of words present in document then after that I have created the data frame of TfidfVectorizer where every row represent the text given in ‘data’ and columns represent the unique words from all the text and the values in the data frame table represent the Tf-Idf value of that word given in text. In above example word, ‘best’ occur only one time that’s why its IDF value is high as compared to the other values its means TF-Idf gives importance to rare words occur in the document.

This is the first part of my blog, in my next blog I will tell you about the Word2Vec technique and how to make Recommendation System by using these Techniques and Cosine similarity and also which technique is best for Recommendation System.