A Beginner’s guide to convert Text Data to Numeric Data (Part 1)

Anuj Shrivastav

Published in

Analytics Vidhya

6 min readFeb 29, 2020

Souce: Google Images

Let’s understand why do we need to convert text data to numeric data.

Computers only understand numbers
Once we convert our text into a vector , we can leverage the beauty of Linear Algebra.

We are going to look at 5 ways of achieving this task:

Bag Of Words (BOW)
Term Frequency -Inverse Document Frequency (Tf-IDF)

In the next part , we’ll look at three more ways

3. Word2Vec (W2V)

4. Average W2V

5. Average TF-IDF

Before we proceed, let’s understand two terms that will be used frequently in this post.

Document- It is nothing but a file that has text data in it. In terms of a dataset , each record or data point can be considered as a document.
Corpus- Set of documents is known as corpus. In terms of a dataset , entire data points or whole of the dataset can be considered as a corpus.

Now we are good to go !!

Bag Of Words (BOW)

In this approach , corresponding to each document , we’ll create a vector of dimension d = unique words in the corpus.

Each cell of a vector would correspond to a unique word and the value in that cell would be the frequency/ number of times that word has appeared in that document.

Let’s understand this using an example:

d₁ := “I am fine”

d₂ := “I am hungry . I am sick”

d₃ := “Food is fine”

U = Unique words in the corpus are := {I , am , fine , hungry , sick, food , is}

number of unique words = n(U) = d = 7

So, we’ll convert each document into a vector of 7 dimensions.

There you go ! You have successfully learnt the first method.

Before we move on to the next method , let’s quickly look at some of the advantages and limitations of this method

Advantages:

Easy to code and understand.
Can be used to implement a baseline model.
Can be used when corpus is small.

Limitations:

Each vector would be a sparse vector.
It doesn’t take semantic meaning of the words into consideration.

Let me show you an example which supports the last point.

Consider two documents:-

d₁ := “Food is tasty”

d₂ := “Food is not tasty”

Before applying BOW on them , through common sense , we know that these two documents are completely opposite of each other . Intuitively, these two documents should be far away from each other and at the maximum distance possible.

Now let’s apply BOW.

Calculating Euclidean distance between the two documents would result in :

However, the maximum distance between the two documents (considering Binary BOW) is :

Now , let’s see some ways to improve the performance of BOW:

Removal of stop words
Converting text to entirely lowercase (or uppercase)
Stemming
Lemmatizing

NOTE : These text preprocessing methods are not just limited to BOW , but can be applied before any of the text to numeric conversion methods

Term Frequency - Inverse Document Frequency (TF-IDF)

First of all , what does the name of this method mean? Let’s break it down.

Term Frequency := is defined as the ratio of number of times a word appears in a document to total number of words in that document

Inverse Document Frequency := is defined as log of ratio of number of documents in the corpus to number of documents in which the word has occurred.

NOTE : 1 is added in the denominator just to avoid division with zero

Unlike frequency of words in case of BOW , we’ll place TF* IDF value of the word in a vector.

But what does TF-IDF value signify?

The Term Frequency (TF) value shows how frequent is the word in a document and IDF value shows how infrequent or rare is the word in other documents

Higher the IDF value , rarer the word in other documents.

Let’s understand this with the help of same example:-

d₁ := “I am fine”

d₂ := “I am hungry . I am sick”

d₃ := “Food is fine”

We have already seen that

U = Unique words in the corpus are := {I , am , fine , hungry , sick, food , is}

number of unique words = n(U) = d = 7

Step 1 : Start with empty d dimension vectors

Step 2 : Calculate term frequency of each word in each of the documents

For e.g in the first document ,

TF(“I”,d₁) = TF(“am”,d₁)= TF(“fine”,d₁) =1/3

and in second document

TF(“I”,d₂) =TF(“am”,d₂) = 2/6

TF(“hungry”,d₂) = TF(“sick”,d₂) = 1/6

and so on…

Step 3 : Calculate IDF values of each of the unique word in the corpus

IDF(“I”,D𝒸) = IDF(“am”,D𝒸) =IDF(“fine”,D𝒸) =

IDF(“hungry”,D𝒸) =IDF(“sick”,D𝒸) = IDF(“food”,D𝒸) = IDF(“is”,D𝒸)=

Step 4 : Multiply each value in the vector (containing TF values) with the corresponding IDF values of the word

Congrats ! You have successfully learnt the second method.

Before moving forward with the third method, let’s quickly see the advantages and limitations of this method.

Advantanges:

Easy to code and implement
Gives more importance to rarer words and suppress the contribution of frequent words
Used in Information Retreival - used by Google for SEO . We can clearly see from our example that all the frequent terms (which have no importance in distinguishing one document from other ) have lower TF-IDF values , on the hand less frequent terms have higher TF-IDF values. When given some query, Google basically convert your text into keywords, find those keywords in its database , sort all the documents where keywords appear in decreasing order of their TF-IDF values and display those documents. (Obviously Google uses far more efficient and quick way of performing this task)

Limitations:

Since it is based on BOW approach , it would also generate sparse vectors.
It also doesn’t take semantic meaning of words into consideration.
It assumes that count of words provides some sort of similarity measure.
Slightly hard to understand at first.

Well , That’s all for now ! I’ll be back with the next part very soon !

A Beginner’s guide to convert Text Data to Numeric Data (Part 1)

Bag Of Words (BOW)

Term Frequency - Inverse Document Frequency (TF-IDF)

Written by Anuj Shrivastav