Word2Vec — CBOW & Skip-gram : Algorithmic Optimizations

Lalit Kishore Vyas
Analytics Vidhya
Published in
5 min readJan 25, 2020

Easy digestible + Small & Crisp:

What Word2Vec does is given a word it returns a vector such that these vectors are semantically similar to the similar words.

Let’s take the below example, in that if the cat is the focused word and other surroundings it are context words. The context words are very useful in understanding the focus words and vice-versa.

There are two Word2Vec algorithms CBOW (Continuous Bag Of Words)& skip-gram.

Word2Vec: CBOW

Let’s suppose we have the vocabulary of words V of size v & we use one-hot encoding to represent each word, so, given any word we can have a binary vector of v dimensions (as we have total v words in the vocabulary).

Let’s suppose we have one hot encoded k context word vectors, so, the core idea of CBOW is given the context words can we predict the focus word.

K context words (each context word is of the dimension of vocabulary v)are connected to the hidden layer of N dimensions as shown below. So, now our goal is to predict the focus word vector of dimension v.

To train Word2Vec CBOW we can create focus words and context words combinations and train this NN. At the end of the training, we have all of these weights (given above in yellow). So for every word we have a vector in the Weight N*V matrix.

Word2Vec: Skip-gram

So, in CBOW we were trying to predict the focus words given the context words whereas in skip-gram we are trying to do the opposite we try to predict the context words given the focus word (exactly the opposite task).

So, we have the focus word of dimension v (size of the vocabulary) which is connected to the hidden layer of size N which further will be connected to the softmax (each softmax is of v dimension) to predict the context words. So, in the output layer, we have k softmax (number of context words), it can be thought as k multiclass classifier.

In both the number of parameters to train but in CBOW we had one softmax to train and whereas in skip-gram we have k softmax, hence skip-gram takes more time than CBOW, so, it is computationally more expensive.

Comparison between CBOW & skip-gram:

  1. CBOW is comparatively faster to train than skip-gram (as CBOW has to train only one softmax).

2. CBOW is better for frequently occurring words (because if a word occurs more often it will have more training words to train).

3. Skip-gram is slower but works well for the smaller amount of data then CBOW.

4. Skip-gram works well for less frequently occurring words than CBOW.

5. CBOW is a simpler problem than the Skip-gram (because in CBOW we just need to predict the one focus word given many context words).

For both Skip-gram & CBOW as the number of context words increase the N-dimensional vector representation is better (typically the N dimension is nearly 200/300).

In the case of Skip-gram, we use the below weight matrix to get the vector of the context words.

In both, we have one major problem we have too many numbers of the parameter to learn (k+1)(N*V). In the case of N=200, k=5 & V=10k we have nearly 12 Million parameters to train & it could take forever to train, so, we need to optimize this training somehow.

Algorithmic Optimization to train Word2Vec:

As we have millions of weights to learn it is extremely time-taking, one way to deal with this is hierarchical softmax & the other is negative sampling.

Hierarchical Softmax:

In CBOW & Skip-gram is the softmax they take a lot of time to train, so, the harder part is optimizing the training of softmax. So, the core idea is can we optimize the v-softmax to make it optimal. What softmax is doing it is doing v class classification, so, can we modify this to make it optimal. Let’s suppose our vocabulary is of 8 words, we can place these words as the leaf nodes of this binary tree and as we break this tree into two parts, half of the words are on one side and others on the other side of the tree.

So, instead of v-class classifier we say does the word occur in this half of the binary tree or is the word is in 1st four or the last four. Let’s suppose it is in the first four words, then we could divide the first four and see whether the word lies in the first two or last two & so on. So, instead of eight activation units, we have used 3 activation units to find out the word. So, if we have v activations in the softmax we will nearly need a log of v base 2 activations to find out the focus word probability.

Negative Sampling:

It is a statistics/probability-based technique, just update the sample of words out of vocabulary per iteration & the sample is defined by —

  1. always keep the target words &
  2. amongst the non-target words, the only sample of them gets updated and given by the below formula in which the probability of word getting selected depends inversely to its frequency

--

--