Word Embedding : Text Analysis : NLP : Part-1

Jaimin Mungalpara
Analytics Vidhya
Published in
6 min readFeb 24, 2021
Image is taken from here.

This article is written to understand different word embedding techniques which are highly required to solve any of the NLP problem.

Introduction

We all are surrounded by different kind of data, these data could be in the form of text, images, speech or any other format. Here, we will discuss about the text data which are in the form of newspaper, emails, books or even any review of the product. To understand this text data we have to take a deep dive inside it and for that we need to process the data in easily understandable way. Nowadays, Natural Language Processing ( NLP ) has made easier to understand this text data and develop various applications like language translation, text classification, text summarization, etc.

All above mentioned NLP techniques require text data to be converted into numeric data which can be learnt by machine. The technique to convert text data in numeric data is called word embedding .

Word Embedding

Word embedding converts the text data to numeric and it can be useful to learn sematic and syntactic context of the word. Also similarity of any words can be checked with this numerical data. Some of the important techniques for embedding are .

  1. One hot encoding ( BOW )
  2. TF-IDF
  3. Word2Vec
  4. GloVe
  5. FastText

This tutorial will be in 2 parts, first part will cover one hot encoding and part-2 will cover Word2Vec and GloVe.

One Hot Encoding ( BOW / Count Vectorizer )

One hot encoding as the name suggest converts the text data in to 0 and 1 representation. It simply take the count of word in a corpus and convert the data into binary form that is why it is called binary embedding also. The final representation of the text data would be a matrix and it would be the vector representation of the data. Steps we can use to convert the data are listed below.

  • Tokenize the text into words
  • Convert the data to lower
  • Preprocess the data with punctuation and stop words removal
  • Create the frequency distribution of words using count vectorizer

Consider a corpus X with Y documents. First we will extract N unique words and it will form a matrix of dimension Y X N.

Code Representation

In this code, we will use use CountVectorizer from sklearn library, it tokenizes text documents, builds bag of word and convert one hot encoded data.

from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
import pandas as pd
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
text = ['He is a boy and he is from India.','The boy is Playing The Cricket.','The Cricket is the most popular game in India.']
vectorizer = CountVectorizer(text,stop_words=stopwords)
sentence_vectors = vectorizer.fit_transform(text)
feature_names = vectorizer.get_feature_names()
dense = sentence_vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)
Output of df

We can see that the unique word list length 6 after removing the stopwords. So our matrix output is 3x6. We can also see that in each raw = sentence which words are taken in unique list are presented with one hot encoded values. In first sentence unique words take are boy and india so those are presented in first raw with value 1. We use the Bag of Words model to extract features from the text convert text into matrix of occurrences of words within document.

Problems with Count Vectorize

Here we can see the matrix generated is sparse matrix. Also, it can not cater sematic information. All the words which are used in the sentences are having 1 value because of this we can not find meaningful words out of the sentences. Also the matrix is having high dimension so it is computationally expensive. To solve this issue we will take a look on another method called TF-IDF

TF-IDF

TF-IDF is a abbreviation of Term Frequency — Inverse Document Frequency.

Term Frequency, defined as the number of times a term occurs in a document. The formula is shown below it is calculated as the division of number of time the term occur in a document and total number of terms in a document .

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

By this way we can take importance of a particular term in the document. However, when we deal with some stopwords like ‘a’,’an’,’the’ etc. are used frequently and due to this the importance of actual words is not discovered. To deal with these kind of words we use inbuild python libraries to remove stopwords. However, we have removed stop words but still we can not get importance of the word if it is used frequently in sentences. To solve this issue we check the next term Inverse Document Frequency.

Inverse Document Frequency is calculates with below formula. Where we divide total number of document with number of document with particular word and take log of this values. So, if the word is used frequently IDF value would move towards 0 else it will move to positive infinity.

IDF(t) = log_e(Total number of documents / Number of documents with term t in it + 1) (* +1 is taken in this formula to avoid denominator 0 value in this formula)

Finally, to get the TF-IDF representation of a word multiplication of TF an IDF is taken. Finally, we can get some values against important words in a sentences. if the word is used in all sentences by default it would be represented with 0 value. So, with this method we can get some sematic information of a word in a sentence.

TF-IDF(t)=TF(t)*IDF(t)

For Example, Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

Code Representation

from sklearn.feature_extraction.text import TfidfVectorizer,TfidfTransformer
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stopwords = set(stopwords.words(‘english’))
text = [‘He is a boy and he is from India.’,’The boy is Playing The Cricket.’,’The Cricket is the most popular game in India.’]
vectorizer = TfidfVectorizer(text,stop_words=stopwords)
vectors = vectorizer.fit_transform(text)
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)
Output of df

We can see that the difference between normal countvectorizer and TF-IDF. For example, in second sentence value of playing is given 0.68 which is higher in this sentence that means it has some important information in this sentence. While, in countvectorizer playing is given a value 1 same as other words in this sentence so we would not have semantic information of this sentence.

Here we imported TfidfTransformer as well it is same as TfidfVectorizer only difference is TfidfTransformer is used with countvectorizer. Below is the code implementation of it.

from sklearn.feature_extraction.text import TfidfVectorizer,TfidfTransformer
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stopwords = set(stopwords.words(‘english’))
text = [‘He is a boy and he is from India.’,’The boy is Playing The Cricket.’,’The Cricket is the most popular game in India.’]
tfIdfTransformer = TfidfTransformer()
countVectorizer = CountVectorizer(text,stop_words=stopwords)
wordCount = countVectorizer.fit_transform(text)
newTfIdf = tfIdfTransformer.fit_transform(wordCount)
feature_names = countVectorizer.get_feature_names()
dense = newTfIdf.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)
Same result as TfidfVectorize

Summary

In the above blog we discussed one hot encoding and TFIDF for word embedding . In next part-2 we will discuss another methods of embedding Word2Vec ,GloVe and FastText. Hope I covered all required information in this blog. Suggestions are highly required if any.

References

  1. http://www.tfidf.com/#:~:text=TF(t)%20%3D%20(Number,how%20important%20a%20term%20is.&text=IDF(t)%20%3D%20log_e(,with%20term%20t%20in%20it).
  2. https://medium.com/sfu-cspmp/nlp-word-embedding-techniques-for-text-analysis-ec4e91bb886f

--

--