We already understood how to create sparse vector like “TF-IDF” and “BoW” for a document with dimensions corresponding to words in the vocabulary of corpus (This dimension will be huge). Please skim through my earlier blog in this series: NLP Zero to One : Sparse Document Representations (Part 2/30) for understanding sparse representations. (*sparse means having too many zeros in vector representations )
As discussed in the previous blog, there are some serious drawbacks with these long sparse vector representations like “TF-IDF” and “BoW”.
1. Large memory and expensive computation because the vectors are long
2. Significant memory loss as order of words in the documents is irrelevant.
3. Hard to model as the number of model parameters to train will be in the scale of input vector length which is huge.
In this blog, we will see how to tackle these problems using dimensionality reduction techniques and importantly deep learning. Using different techniques, we will extract powerful word representations called embeddings (Dense, short vectors). Unlike the TFIDF or BoW, these vectors length is in the range of 50–300. These vectors work better in every NLP problem than sparse vectors as order/structure of words play a major role. So similar meaning words have similar representations.
For example: “boat” and “ship” mean two different things in sparse vector representations, but embedding succeed in capturing the similarity between these words. There are 2 most popular and opensources embedding models Word2Vec and GLoVe. The word2vec methods are fast, efficient to train, and easily available online with static code and pretrained embeddings.
In this section, we will understand how to use deep learning to create word embeddings. These embedding are so powerful that vector representation of queen is very similar to that of v(king) − v(man) + v(woman). These representations are powerful in capturing syntactic relationships.
But we are about to learn how deep learning can be used to create this embeddings, it would be best if we start with neural architecture and intuition behind that architecture. There are 2 popular different architectures proposed and its compulsory for every NLP practitioner to get familiar. The proposed architectures consisted of the continuous bag-of-words (CBOW) model and the skip-gram model.
CBOW (Continuous bag-of-words) ..
Word used in similar ways/context result in similar representations. For example synonyms like sad and unhappy are used in similar context. But how do we define context?
The neighbouring words will give use the context of the target word( which is “sad” in the above example. So here context is simply window of c words to the left and right side of the target.
Classification Problem Setting..
In line with the intuition described, we will try to predict the current target word (“sad”) based on the context words (surrounding words). The number of surrounding words to consider for predicting is called context window. For above example, if context window equals 2 then the train data will be ([“would”,“be”, “memory”, “to”],[“sad”]). So if you observe closely this neural architecture is unsupervised, All we need to give is huge corpus(set of all documents) nothing more than that. It can create X(input is surrounding variables ), y (target word) in rolling manner as shown in the below diagram and it can construct dense word embeddings from the corpus.
Once the X: input/context words and y: output/target words are created from the corpus as described, the immediate task to design a model that does classification for us where we try to predict a target word from the context words.
There are 3 important aspects this neural architecture, Input layer, lambda layer/averaging layer and dense softmax. The most important and also confusing component is input layer. Input layer is often called as embedding layer.
Let’s say we have a vocabulary of N words and we plan to get a dense vector of size K. The input layer maps each context word through an embedding matrix N to a dense vector representation of dimension K, so it is a NxK matrix where each word has a respective K sized vector.
Input layer depiction:
Once we understand the working of input layer, the rest of architecture is very simple to understand. All the context words are fed into the embedding layer/ input layer and all the vectors corresponding to the context words are averaged. This aspect is handles by lambda layer or average layer. Once the average vector is fed into a softmax layer which predicts the target word from the entire vocabulary of the corpus.
Now we understood the intuition and subsequently neural arcitecture. Now we will try to get some insights into the training mechanisms and how the weights are updated. I urge readers to skim through my earlier blog (NLP Theory and Code: Deep Learning Training Procedure (Part 4/40 ) to understand how training carries out in neural networks.
Training process can be described by just answering 2 questions:
1.What are trainable/learnable parameters ?
2.What is loss function ?
What are trainable/learnable parameters?
The embedding layer is randomly initialised and all the numbers in the embedding layer are trainable parameters. So the embedding layer gets better and better as more data is fed into the model.
What is loss function?
Log of conditional probability of target word given context words. We match the predicted word with the actual target word, compute the loss by leveraging the categorial cross entropy loss and perform backpropagation with each epoch to update the embedding layer in the process.
We will discuss the next popular architecture in Word2Vec, “Skip-gram” in the next blog in this series. Also we will discuss GloVe, an another popular pre-trained embedding.