A Beginner’s guide to convert Text Data to Numeric Data (Part 1)

Anuj Shrivastav
Analytics Vidhya
Published in
6 min readFeb 29, 2020
Souce: Google Images

Let’s understand why do we need to convert text data to numeric data.

  • Computers only understand numbers
  • Once we convert our text into a vector , we can leverage the beauty of Linear Algebra.

We are going to look at 5 ways of achieving this task:

  1. Bag Of Words (BOW)
  2. Term Frequency -Inverse Document Frequency (Tf-IDF)

In the next part , we’ll look at three more ways

3. Word2Vec (W2V)

4. Average W2V

5. Average TF-IDF

Before we proceed, let’s understand two terms that will be used frequently in this post.

  • Document- It is nothing but a file that has text data in it. In terms of a dataset , each record or data point can be considered as a document.
  • Corpus- Set of documents is known as corpus. In terms of a dataset , entire data points or whole of the dataset can be considered as a corpus.

Now we are good to go !!

Bag Of Words (BOW)

Source: My PC

In this approach , corresponding to each document , we’ll create a vector of dimension d = unique words in the corpus.

Each cell of a vector would correspond to a unique word and the value in that cell would be the frequency/ number of times that word has appeared in that document.

Let’s understand this using an example:

d₁ := “I am fine”

d₂ := “I am hungry . I am sick”

d₃ := “Food is fine”

U = Unique words in the corpus are := {I , am , fine , hungry , sick, food , is}

number of unique words = n(U) = d = 7

So, we’ll convert each document into a vector of 7 dimensions.

Souce : My PC

There you go ! You have successfully learnt the first method.

Before we move on to the next method , let’s quickly look at some of the advantages and limitations of this method

Advantages:

  • Easy to code and understand.
  • Can be used to implement a baseline model.
  • Can be used when corpus is small.

Limitations:

  • Each vector would be a sparse vector.
  • It doesn’t take semantic meaning of the words into consideration.

Let me show you an example which supports the last point.

Consider two documents:-

d₁ := “Food is tasty”

d₂ := “Food is not tasty”

Before applying BOW on them , through common sense , we know that these two documents are completely opposite of each other . Intuitively, these two documents should be far away from each other and at the maximum distance possible.

Now let’s apply BOW.

Calculating Euclidean distance between the two documents would result in :

However, the maximum distance between the two documents (considering Binary BOW) is :

Now , let’s see some ways to improve the performance of BOW:

  • Removal of stop words
  • Converting text to entirely lowercase (or uppercase)
  • Stemming
  • Lemmatizing

NOTE : These text preprocessing methods are not just limited to BOW , but can be applied before any of the text to numeric conversion methods

Term Frequency - Inverse Document Frequency (TF-IDF)

First of all , what does the name of this method mean? Let’s break it down.

  • Term Frequency := is defined as the ratio of number of times a word appears in a document to total number of words in that document
  • Inverse Document Frequency := is defined as log of ratio of number of documents in the corpus to number of documents in which the word has occurred.

NOTE : 1 is added in the denominator just to avoid division with zero

Unlike frequency of words in case of BOW , we’ll place TF* IDF value of the word in a vector.

But what does TF-IDF value signify?

The Term Frequency (TF) value shows how frequent is the word in a document and IDF value shows how infrequent or rare is the word in other documents

Higher the IDF value , rarer the word in other documents.

Let’s understand this with the help of same example:-

d₁ := “I am fine”

d₂ := “I am hungry . I am sick”

d₃ := “Food is fine”

We have already seen that

U = Unique words in the corpus are := {I , am , fine , hungry , sick, food , is}

number of unique words = n(U) = d = 7

Step 1 : Start with empty d dimension vectors

Step 2 : Calculate term frequency of each word in each of the documents

For e.g in the first document ,

TF(“I”,d₁) = TF(“am”,d₁)= TF(“fine”,d₁) =1/3

and in second document

TF(“I”,d₂) =TF(“am”,d₂) = 2/6

TF(“hungry”,d₂) = TF(“sick”,d₂) = 1/6

and so on…

Step 3 : Calculate IDF values of each of the unique word in the corpus

IDF(“I”,D𝒸) = IDF(“am”,D𝒸) =IDF(“fine”,D𝒸) =

IDF(“hungry”,D𝒸) =IDF(“sick”,D𝒸) = IDF(“food”,D𝒸) = IDF(“is”,D𝒸)=

Step 4 : Multiply each value in the vector (containing TF values) with the corresponding IDF values of the word

Congrats ! You have successfully learnt the second method.

Before moving forward with the third method, let’s quickly see the advantages and limitations of this method.

Advantanges:

  • Easy to code and implement
  • Gives more importance to rarer words and suppress the contribution of frequent words
  • Used in Information Retreival - used by Google for SEO . We can clearly see from our example that all the frequent terms (which have no importance in distinguishing one document from other ) have lower TF-IDF values , on the hand less frequent terms have higher TF-IDF values. When given some query, Google basically convert your text into keywords, find those keywords in its database , sort all the documents where keywords appear in decreasing order of their TF-IDF values and display those documents. (Obviously Google uses far more efficient and quick way of performing this task)

Limitations:

  • Since it is based on BOW approach , it would also generate sparse vectors.
  • It also doesn’t take semantic meaning of words into consideration.
  • It assumes that count of words provides some sort of similarity measure.
  • Slightly hard to understand at first.

Well , That’s all for now ! I’ll be back with the next part very soon !

Source: Google Images

--

--