Day 24 of 100DaysofML

Charan Soneji
100DaysofMLcode
Published in
4 min readJul 10, 2020

Word Embeddings. So I thought of taking a small topic and elaborating on it. I haven’t been able to blog a lot lately because of pressure but I’m gonna still try my best to put whatever I can, out there. So here goes.

What are Word Embeddings?
So NLP is an extremely vast subject and has a lot of concepts and features. One of those features happens to be this. Now there are a number of other functionalities such as TF-IDF matrix which can be very useful so what exactly is the need of a completely new model. Here’s the thing, TF-IDF matrices do not actually understand the composition of a sentence and they just give us results based on calculations that they have made. Unlike TF-IDF, word embeddings are a bit different.

There’s another thing that we need to understand here, every language has a vast number of words and simultaneously it has a vast number of features but we can’t keep defining new features because it gets hard to store any more model weights because it can’t be fed to a model such as LSTM. This is why we need to reuse these features from the language.

Basically, a word embedding helps us to create a vector of fixed dimension which represents a real value on the position of usage of that word and creates a score based on the relevance of that word in that position. This may sound a little confusing at first, but it will get easy.

Have a look at the sentence given on the left. queen-girl+boy=king. What does this mean. Let’s say you have a 4 different words and each of these words is represented using a vector. Now when the following vector calculations are made, we get the result as king (Keep in mind, there isn’t much of a way to identify what these numerical values are in the vector, because they are all obtained through ML models). But a rough idea to visualize and understad the given example is that: from a queen, we are subtracting girl in order to remove the female/gender feature and add boy because we want to add that attribute to the vector.

Essentially, we have two main types of Word Embeddings, they are:
1. Pretrained
- Word2Vec
- Glove
2. Train Embedding layer

The main difference between the two is that in case of a pretrained model, we train the vector which determines the value before using our NLP model or task but on the other hand, the trained Embedding layer is used alongside the training stage and is done simultaneously.
I’m going to be discussing Word2Vec in today’s blog and I would prolly discuss Glove in tomorrows.

So word2Vec also has 2 types.
1. CBOW: Continuous Bag of Words model
2. SkipGram model

There isn’t much of a difference between the two and each one just works in the exact opposite way. So CBOW model uses a specific set fo words as training and predicts the target word but on the other hand, the SkipGram model does the exact opposite and predicts the input words from the target word.

The model for the implementation consists of the embedding layer which is something I will discuss in another blog but essentially the embedding layer basically holds random weights for your words and then gets trained or fixed during the training phase. The embedding layer is then connected to the Average layer which is used to sort of normalize the values across this huge 300 dimension embedding layer and finally we have a activation function (Softmax). So that's the upper view concept of CBOW model.

The SkipGram model works in a slightly different way. Since it predicts the context words from the target words, it pairs the target word with each of the context words whose weights are then merged in the Merge layer and then passed onto the Sigmoid activation function since it is not a multiclass prediction.

Try and look at the picture below just to get an overall understanding of what these two stand for.

Example for comparison between CBOW and SkipGram

One of the videos which I enjoyed watching to understand this concept was:

That’s it for today. Keep Learning.

Cheers.

--

--