A Beginner’s guide to convert Text Data to Numeric Data (Part 1)
Let’s understand why do we need to convert text data to numeric data.
- Computers only understand numbers
- Once we convert our text into a vector , we can leverage the beauty of Linear Algebra.
We are going to look at 5 ways of achieving this task:
- Bag Of Words (BOW)
- Term Frequency -Inverse Document Frequency (Tf-IDF)
In the next part , we’ll look at three more ways
3. Word2Vec (W2V)
4. Average W2V
5. Average TF-IDF
Before we proceed, let’s understand two terms that will be used frequently in this post.
- Document- It is nothing but a file that has text data in it. In terms of a dataset , each record or data point can be considered as a document.
- Corpus- Set of documents is known as corpus. In terms of a dataset , entire data points or whole of the dataset can be considered as a corpus.
Now we are good to go !!
Bag Of Words (BOW)
In this approach , corresponding to each document , we’ll create a vector of dimension d = unique words in the corpus.
Each cell of a vector would correspond to a unique word and the value in that cell would be the frequency/ number of times that word has appeared in that document.
Let’s understand this using an example:
d₁ := “I am fine”
d₂ := “I am hungry . I am sick”
d₃ := “Food is fine”
U = Unique words in the corpus are := {I , am , fine , hungry , sick, food , is}
number of unique words = n(U) = d = 7
So, we’ll convert each document into a vector of 7 dimensions.
There you go ! You have successfully learnt the first method.
Before we move on to the next method , let’s quickly look at some of the advantages and limitations of this method
Advantages:
- Easy to code and understand.
- Can be used to implement a baseline model.
- Can be used when corpus is small.
Limitations:
- Each vector would be a sparse vector.
- It doesn’t take semantic meaning of the words into consideration.
Let me show you an example which supports the last point.
Consider two documents:-
d₁ := “Food is tasty”
d₂ := “Food is not tasty”
Before applying BOW on them , through common sense , we know that these two documents are completely opposite of each other . Intuitively, these two documents should be far away from each other and at the maximum distance possible.
Now let’s apply BOW.
Calculating Euclidean distance between the two documents would result in :
However, the maximum distance between the two documents (considering Binary BOW) is :
Now , let’s see some ways to improve the performance of BOW:
- Removal of stop words
- Converting text to entirely lowercase (or uppercase)
- Stemming
- Lemmatizing
NOTE : These text preprocessing methods are not just limited to BOW , but can be applied before any of the text to numeric conversion methods
Term Frequency - Inverse Document Frequency (TF-IDF)
First of all , what does the name of this method mean? Let’s break it down.
- Term Frequency := is defined as the ratio of number of times a word appears in a document to total number of words in that document
- Inverse Document Frequency := is defined as log of ratio of number of documents in the corpus to number of documents in which the word has occurred.
NOTE : 1 is added in the denominator just to avoid division with zero
Unlike frequency of words in case of BOW , we’ll place TF* IDF value of the word in a vector.
But what does TF-IDF value signify?
The Term Frequency (TF) value shows how frequent is the word in a document and IDF value shows how infrequent or rare is the word in other documents
Higher the IDF value , rarer the word in other documents.
Let’s understand this with the help of same example:-
d₁ := “I am fine”
d₂ := “I am hungry . I am sick”
d₃ := “Food is fine”
We have already seen that
U = Unique words in the corpus are := {I , am , fine , hungry , sick, food , is}
number of unique words = n(U) = d = 7
Step 1 : Start with empty d dimension vectors
Step 2 : Calculate term frequency of each word in each of the documents
For e.g in the first document ,
TF(“I”,d₁) = TF(“am”,d₁)= TF(“fine”,d₁) =1/3
and in second document
TF(“I”,d₂) =TF(“am”,d₂) = 2/6
TF(“hungry”,d₂) = TF(“sick”,d₂) = 1/6
and so on…
Step 3 : Calculate IDF values of each of the unique word in the corpus
IDF(“I”,D𝒸) = IDF(“am”,D𝒸) =IDF(“fine”,D𝒸) =
IDF(“hungry”,D𝒸) =IDF(“sick”,D𝒸) = IDF(“food”,D𝒸) = IDF(“is”,D𝒸)=
Step 4 : Multiply each value in the vector (containing TF values) with the corresponding IDF values of the word
Congrats ! You have successfully learnt the second method.
Before moving forward with the third method, let’s quickly see the advantages and limitations of this method.
Advantanges:
- Easy to code and implement
- Gives more importance to rarer words and suppress the contribution of frequent words
- Used in Information Retreival - used by Google for SEO . We can clearly see from our example that all the frequent terms (which have no importance in distinguishing one document from other ) have lower TF-IDF values , on the hand less frequent terms have higher TF-IDF values. When given some query, Google basically convert your text into keywords, find those keywords in its database , sort all the documents where keywords appear in decreasing order of their TF-IDF values and display those documents. (Obviously Google uses far more efficient and quick way of performing this task)
Limitations:
- Since it is based on BOW approach , it would also generate sparse vectors.
- It also doesn’t take semantic meaning of words into consideration.
- It assumes that count of words provides some sort of similarity measure.
- Slightly hard to understand at first.
Well , That’s all for now ! I’ll be back with the next part very soon !