Word Embeddings

Arnab
Analytics Vidhya
Published in
8 min readMay 28, 2021

This article aims to provide intuition for word embeddings using information from some of the most famous papers in the field.

Photo by Markus Spiske on Unsplash

Why do we need word vectors?

The other choice that we have is to represent each word in a document as a one-hot encoded vector. Let’s try to think about some problems with this.

  1. Vocabulary Size: Each word would be represented by a shape as long as your vocabulary. This could be in the order of the millions.
Example output for one_hot_encode(“ The town was near the city “ )

2. Semantics: Now apart from an issue with scale, we lose the ‘similarity between words that should be close together. In other words with this representation, all words are orthogonal to each other. In a high dimensional space, they are all pointing in different directions.

Ideally, in the example given above ‘town’ and ‘city’ should have somewhat similar vector representations. But in this case, their cosine similarity is 0 (they are at an angle of 90 degrees from each other).

Thus, ideally what we would like is to represent each word by a vector of a size far lower than the size of the vocabulary. Additionally, we would also like these vectors to provide some ‘meaning’ by themselves to the word. Interestingly, in practice, adding or subtracting two or more of these words produces intriguing results. For instance, vector(Germany) + vector(Captial) might lead to vector(Berlin).

How do we build these word vectors?

Although there are several methods out there, one of the key concepts to understand amongst all of them is context & center words.

The intuition ( known as distributional semantics ) is that words that are close to a particular word in a document convey meaning for that word. For example, if we have a sentence like ‘ The teacher scolds the student fairly often’. For a center word student and a window size of 5 ( taking both sides of the word), we would have the words ‘teacher’, ‘scolds’, ‘the’, ‘fairly’, ‘often’ in its context. Now, if we have a big enough corpus, the idea is that we would come up with representations for the word by looking at all the words in the context of each word.

Now there are two ways to go about this, we can either just count how often a word occurs in the context and figure out a representation for each word, or we can train a neural network to figure out which vectors are best suited by predicting the context vectors given our center word. In this article, I will cover the latter.

Word2vec:

Originally, the idea was proposed by Tomas Mikolov in 2013 in this paper titled ‘Efficient Estimation of Word Representations in Vector Space’ which was further improved by him in ‘Distributed Representations of Words and Phrases and their Compositionality’. Two methods were introduced in the first paper.

  1. Continuos bag of words: The center word is predicted given the context words.
  2. Skip-Gram: The context words are predicted given a center word.
The models as described in the paper.
Skip-gram Illustration

As denoted by the illustration, with the skip-gram model, we need to find P ( context | center ). But our original goal was to find out word-vectors right? Here comes our build-up to the objective function.

Likelihood function

The parameter theta in this case refers to our word vectors. We must maximize the likelihood of our model predicting the context words given the center word. The inner product is for a particular center word and the likelihood function is overall words as center words ( T total words ).

For both the skip-gram & the CBOW models we require two vectors to represent each of the words. One for when the word is in the context and the other when it is in the center. We call these vectors U & V. To represent it as a distribution we take the dot product of both of these vectors, use exponential to make the dot product positive, and normalize it with all the vectors in the vocabulary. Our goal should be to find the vectors U & V which maximizes the probability of the correct words being in the context given a center word. Defining the probability, Vcenter is given by multiplying our V matrix ( N x V) where N is the embedding dimension and ‘V’ is the size of the vocab, with the one-hot encoded input vector. That itself is multiplied by ALL of the context vectors in the U Matrix of size (V X N). Ultimately we get ‘scores’ for each of the words in the vocabulary, and these scores are in turn passed into the softmax function. In a way, this is like multi-class classification, where we predict which ‘word’ is in the context given the center word.

Final objective

To make it easier for us to take derivates we stick a log in front of the probabilities. Additionally, we add a minus sign so that we have to ‘minimize’ the function instead of taking the max. We also add the (1/T) to make things invariant of scale, and not depend on the size of the corpus.

To train this model, we need to update the vectors using gradient descent.

Weight updates using gradient descent

Let’s take a concrete example and go through each step that we have to take to go through the forward pass of the model.

Defining function for softmax

Initially, the text is one hot-encoded & one of the words is passed to the network. In this case, ‘Ease’ is passed as a center word ( sorted by vocabulary, 2nd word ) and from the sentence, we can see that the context words for ease ( with a window of size 2 ) are ‘with’ & ‘town’.

We randomly initialize the context and the center word matrix V & U.

Again, each column of V contains the embedding for each word when it is in the center, and each row contains the embedding for each word when it is in the context. For example, this is the initialization for matrix U.

#### U matrix as a dict with vocab as the key{'dangerous': array([ 1.2398147 , -0.87909626,  0.38028566]),  'ease':       array([0.08540021, 0.18544306, 0.07096524]), 
'man': array([ 0.14981179, -1.40251029, 1.02452393]),
'passed': array([0.10055589, 1.09979612, 1.63213385]),
'the': array([-0.22917658, 1.32505777, -0.74246023]),
'through': array([0.21855831, 1.05448857, 0.71948246]),
'town': array([-0.17041045, -1.60960955, -0.56273196]),
'with': array([-1.52561797, 0.00701094, -0.46619808])}

After we pass one word to the network for the forward pass we receive softmax probabilities predicting each context word.

##Predicted probability distribution{'dangerous': 0.0006688799877958782,  
'ease': 0.024971817920074583,
'man': 0.004299169023935694,
'passed': 0.9438920599330834,
'the': 0.004081941089464888,
'through': 0.0027767995527184113,
'town': 0.009752776322407003,
'with': 0.009556556170520179}

With our gibberish normal initialization, the network predicts ‘passed’ as the word with the highest probability in context with center ‘ease’ with a probability of 0.94. But the true distribution, for the center word ease, has probability 1 for with and town and 0 everywhere else.

###True Distribution{'dangerous': 0,
'ease': 0,
'man': 0,
'passed': 0,
'the': 0,
'through': 0,
'town': 1,
'with': 1
}

So we compare and train our word vectors U & V until our output finally looks something like the true distribution. I will ignore details of the backward pass since I believe with automatic differentiation available in the frameworks we have currently, it really isn’t that significant. But if you’re interested you can check out how it works here.

There’s another problem with this method for word2vec, each time we have to go through the entire vocab, which as discussed before isn’t computationally great. If you think about the complexity it’s in the order of O( V + V ) where V is the size of your vocabulary.

We solve that problem with something known as negative sampling which was introduced in the 2nd paper mentioned above. Instead of using the softmax function, we use a sigmoid function and turn it into a binominal classification problem (2 classes, positive & negative ).

We sample from words that aren’t in the context and use them as the negative classes. In our running example for the center word ‘ease’, we sample some other words like ‘the’ and use it as the negative classes. We sample K of these classes. So instead of updating over the entire vocabulary, we get a complexity of ~ O (K).

Results

If you want to train them on your own corpus, it’s fairly straightforward on deep learning frameworks. The original word2vec paper had its model trained on news data from google. Later, down the years' people have trained their models on Wikipedia datasets and shown better results. This is probably because a single document in Wikipedia has more words related to a certain topic, whereas in the news we might have more of a variety of words. Nevertheless, I will use the original word vectors in this section.

We look at trained word vectors for labels like king, queen, cat, dog & animal. Then we take the 2nd rank approximate for the vectors using SVD. (You can learn more about SVD from my previous article here)

We can see even with the 2nd rank approximate words like Animal, dog, and cat are clustered together, whereas king’ and ‘queen is closer together. Even though projection of the vectors (Originally in 300 dimensions) into 2 dimensions makes us lose a lot of information, we can visualize somewhat of how some vectors are similar.

We can use the full vectors to work out some word analogies and similarities.

Some of the similarities worked out were as follows:

#Analogy
similarity(w['king']-w['man']+ w['woman'],w['queen']) ---> 0.73
#Dissimilar words
similarity(w['king'], w['car']) ---> 0.03
#Similar words
similarity(w['man'], w['woman']) ---> 0.76

--

--