Word2Vec, GLOVE, FastText and Baseline Word Embeddings step by step

Akash Deep
Analytics Vidhya
Published in
9 min readAug 22, 2020

In our previous discussion we had understand the basics of tokenizers step by step. If you had not gone through my previous post i highly recommend just have a look at that post because to understand Embeddings first, we need to understand tokenizers and this post is the continuation of the previous post. I am providing the link below of my post on Tokenizers. I had explained the concepts step by step with a simple example

There are many more ways like countvectorizer and TF-IDF. But in both, the context of the words are not maintained that results in very low accuracy and again based on different scenarios we need to select. Countvectorizer and TF-IDF is out of scope from this discussion. Coming to embeddings, first we try to understand what the word embedding really means. As we know there are more than 171,476 of words are there in english language and each word have their different meanings. If we want to represent 171,476 or even more words in the dimensions based on the meaning each of words, then it will result in more than 3–4 lakhs dimension because we have discussed few time ago that each and every words have different meanings and one thing to note there there is a high chance that meaning of word also change based on the context. To understand better about contexual based meaning we will look into below example

Ex- Sentence 1: An apple a day keeps doctor away. Sentence 2: The stock price of Apple is falling down due to COVID-19 pandemic. I

In the above example the meaning of the Apple changes depending on the 2 different context. So if we will look the contexual meaning of different words in different sentences then there are more than 100 billion on internet. So to understand the real meanings of each and every words on the internet, google and facebook has developed many models. word2vec and glove are developed by Google and fastText model is developed by Facebook. Word2Vec is trained on word vectors for a vocabulary of 3 million words and phrases that they trained on roughly 100 billion words from a Google News dataset and simmilar in case of GLOVE and fastText. Here embedding is the dimensions in which all the words are kept based on the meanings and most important based on different context again i am repeating based on the different context.

In this post we will try to understand the intuition behind the word2vec, glove, fastText and basic implementation of Word2Vec programmatically using the gensim library of python. programmatical implementation of glove and fastText we will look some other post.

We will try to understand the basic intuition behind Word2Vec, GLOVE and fastText one by one. First will start with Word2vec

Word2Vec:The main idea behind it is that you train a model on the context on each word, so similar words will have similar numerical representations. Just like a normal feed-forward densely connected neural network(NN) where you have a set of independent variables and a target dependent variable that you are trying to predict, you first break your sentence into words(tokenize) and create a number of pairs of words, depending on the window size. So one of the combination could be a pair of words such as (‘cat’,’purr’), where cat is the independent variable(X) and ‘purr’ is the target dependent variable(Y) we are aiming to predict.

We feed the ‘cat’ into the NN through an embedding layer initialized with random weights, and pass it through the softmax layer with ultimate aim of predicting ‘purr’. The optimization method such as SGD minimize the loss function “(target word | context words)” which seeks to minimize the loss of predicting the target words given the context words. If we do this with enough epochs, the weights in the embedding layer would eventually represent the vocabulary of word vectors, which is the “coordinates” of the words in this geometric vector space.

GLOVE:GLOVE works similarly as Word2Vec. While you can see above that Word2Vec is a “predictive” model that predicts context given word, GLOVE learns by constructing a co-occurrence matrix (words X context) that basically count how frequently a word appears in a context. Since it’s going to be a gigantic matrix, we factorize this matrix to achieve a lower-dimension representation. There’s a lot of details that goes in GLOVE but that’s the rough idea.

FastText:FastText is quite different from the above 2 embeddings. While Word2Vec and GLOVE treats each word as the smallest unit to train on, FastText uses n-gram characters as the smallest unit. For example, the word vector ,”apple”, could be broken down into separate word vectors units as “ap”,”app”,”ple”. The biggest benefit of using FastText is that it generate better word embeddings for rare words, or even words not seen during training because the n-gram character vectors are shared with other words. This is something that Word2Vec and GLOVE cannot achieve.

Baseline: Baseline is something which doesn’t uses any of these 3 embeddings or i can say directly the tokenized words are passed into the keras embeddings layers but for these 3 embedding types we need to pass our dataset to these pre-trainned embedding layers and output by thease 3 embeddings need to be passed on the keras embedding layers. Implementation of the keras embedding layer is not in scope of this tutorial, that we will see in any further post, but how the flow is we need to understand.

We had learnt the basics of Word2Vec, GLOVE and FastText and came to a conclusion that all the above 3 are word embeddings and can be used based on the different usecases or we can just play with these 3 pre-trainned in our usecases and then which results in more accuracy we need to use for our usecases. Now step by step we will see the implementation of word2vec programmetically

As i mentioned above we will be using gensim library of python to import word2vec pre-trainned embedding. This can be done by executing below code. I am using google colab for execution of all code in my all posts.

from gensim.models import Word2Vec

Now we will take one very simple paragraph on which we need to apply word embeddings. I am taking small paragraph in my post so that it will be easy to understand and if we will understand how to use embedding in small paragraph then obiously we can repeat same steps on huge datasets.

We will take “paragraph=Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. Unqualified, the word football normally means the form of football that is the most popular where the word is used. Sports commonly called football include association football (known as soccer in some countries); gridiron football (specifically American football or Canadian football); Australian rules football; rugby football (either rugby union or rugby league); and Gaelic football.These various forms of football share to varying extent common origins and are known as football codes.”

we can see in above paragraph we have many stopwords and the special character so we need to remove these all first. We are removing because we already know, these all will not add any information to our corpus. To acheive this task we don’t need to worry too much. We have “NLTK” package in python which will remove stop words and “regular expression” package which will remove special characters. Please refer below snippet for detail

Code to import NLTK, RE and paragraph initialization

Now we will remove all the special characters from our paragraph by using below code and we will store the clean paragraph in text variable

After applying text cleaning we will look the length of the paragraph before and after cleaning. Note after cleaning the text we had store in the text variable. Clearly we can able to see earlier the length was 598 and now it reduced to 593 after cleaning

length of the words before and after cleaning

Now we will convert the words into sentence and stored in list by using below code. Clearly we can see see the “sent_tokenize” method has converted the 593 words in 4 sentences and stored it in list, basically we got list of sentences as output. The “sent_tokenize” has used “.” as a mark to segment the words in sentence

Word to sentence conversion

Now we will convert this list of sentences to list of words by using below code. List of sentences got converted into list of words and stored in one more list.

List of sentences to list of words

As we got the list of words and now we will remove all the stopwords like is, am, are and many more from the list of words by using below snippet of code. We can compare the the output snippet of previous and below code we will see the differences clearly that stopwords like ‘is’, ‘a’ and many more has been removed from the sentences

stopwords removal

Now we are good to go to apply word2vec embedding on the above prepared words. Word2vec is a class that we have already imported from gensim library of python. Now we will pass the pre-processed words to word2vec class and we will specify some attributes while passsing words to word2vec class. Some of the important attributes are listed below

  1. min_counts: It Specifies the minimum count of the occurance of the simmilar word. generally we used to specify as 2 and 3 which means word2vec will keep same word 2 or 3 times and if that word occur more than 3 times it will remove from the list of words that we will be passing as input
  2. size: This is also one of the most important attribute that we need to keep in mind. Size specifies the dimension space we want to specify for a word. simply it means the closed simmilar dimension of that word that we had discussed early. By default its value is 100. we can change the value. Here 100 means it will assign 100 vectors to the word. In detail we will see in the below snippet.

In the below snippet we had created a model object from Word2Vec class instance and also we had assigned min_count as 1 because our dataset is very small i mean it has just a few words. Size we had specified as 10 so the 10 vectors i.e dimensions will be assigned to all the passed words in the Word2Vec class. We will be using the method “wv” on the created model object and pass any word from our list of words as below to check the number of dimension or vectors i.e 10 in our case

Word2vec training
We had used word rules from our list of words and we got the size of 10 vectors as expected

In the above post we had successfully applied word2vec pre-trained word embedding to our small dataset. If we have understand this concepts then i am sure we can able to apply the same concepts on the larger dataset. For more practice on word embedding i will suggest take any huge dataset from UCI Machine learning Repository and apply the same discussed concepts on that dataset. In the next blog we will try to understand the Keras embedding layers and many more. If any one have any doubts realted to the topics that we had discussed as a part of this post feel free to comment below i will be very happy to solve your doubts.

Stay Tuned

--

--