NLP Zero to One: Dense Representations, Word2Vec (Part 5/30)

Word Embeddings and Semantic Representations

Kowshik chilamkurthy
Mar 2 · 5 min read


In this blog, we will see how to tackle these problems using dimensionality reduction techniques and importantly deep learning. Using different techniques, we will extract powerful word representations called embeddings (Dense, short vectors). Unlike the TFIDF or BoW, these vectors length is in the range of 50–300. These vectors work better in every NLP problem than sparse vectors as order/structure of words play a major role. So similar meaning words have similar representations.
For example: “boat” and “ship” mean two different things in sparse vector representations, but embedding succeed in capturing the similarity between these words. There are 2 most popular and opensources embedding models Word2Vec and GLoVe. The word2vec methods are fast, efficient to train, and easily available online with static code and pretrained embeddings.


But we are about to learn how deep learning can be used to create this embeddings, it would be best if we start with neural architecture and intuition behind that architecture. There are 2 popular different architectures proposed and its compulsory for every NLP practitioner to get familiar. The proposed architectures consisted of the continuous bag-of-words (CBOW) model and the skip-gram model.

CBOW (Continuous bag-of-words) ..


The neighbouring words will give use the context of the target word( which is “sad” in the above example. So here context is simply window of c words to the left and right side of the target.

Classification Problem Setting..

Picture depicting that the training data is collected in rolling manner

Once the X: input/context words and y: output/target words are created from the corpus as described, the immediate task to design a model that does classification for us where we try to predict a target word from the context words.

Explanation of classification setting

Neural Architecture..

Input layer depiction:

Picture depicting the input later, for a word Wi, there is corresponding K length embedding vector

Once we understand the working of input layer, the rest of architecture is very simple to understand. All the context words are fed into the embedding layer/ input layer and all the vectors corresponding to the context words are averaged. This aspect is handles by lambda layer or average layer. Once the average vector is fed into a softmax layer which predicts the target word from the entire vocabulary of the corpus.

Depiction of whole CBOW neural architecture

CBOW Training..

Training process can be described by just answering 2 questions:
1.What are trainable/learnable parameters ?
2.What is loss function ?

What are trainable/learnable parameters?
The embedding layer is randomly initialised and all the numbers in the embedding layer are trainable parameters. So the embedding layer gets better and better as more data is fed into the model.
What is loss function?

Log of conditional probability of target word given context words. We match the predicted word with the actual target word, compute the loss by leveraging the categorial cross entropy loss and perform backpropagation with each epoch to update the embedding layer in the process.


Generated by author

Previous: NLP Zero to One: Deep Learning Training Procedure (Part 4/30)
Next: NLP Zero to One: Count based Embeddings, GloVe (Part 6/40)

Nerd For Tech

From Confusion to Clarification

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit

Kowshik chilamkurthy

Written by


Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit