DeepLearning series: Natural Language Processing and Word Embeddings

Michele Cavaioni
Machine Learning bites
9 min readMar 15, 2018

In the previous blog related to Sequence Models and Language model generation, we have used a dictionary of words, and each word was represented by a one-hot representation. For example, the word “woman” was word number 5678 in the dictionary, and the one-hot representation was a vector of zeros with a 1 in the 5678th position.

The most significant weakness of this representation is that each word is treated as a thing by itself and there is nothing that can help generalize and create an analogy reasoning.

For example, if we tell the machine “men is to woman as boy is to ___?” it won’t be able to deduct that analogy and quickly say “girl”. For the machine, girl is a number by itself, with no correlation whatsoever with anything else. This is undoubtedly a huge drawback because in natural language processing making analogies to predict a word is crucial.

Word embeddings are what creates a feature representation of every word. Therefore it allows a correlation among words. Each word is a vector of a certain dimension, which represents some features.

(I will refer to 300 features throughout the rest of this blog, which is a sort of a standard). Therefore, now similar words will have same values of features.

Regarding features think about characteristics you would give to a word. For example age, size, gender, height. The algorithm to learn word embeddings examines very large text corpora, so big datasets (of around 1–100 billion words) of unlabeled texts.

In case you have a small training dataset, you can take a pre-trained embeddings and use transfer learning to transfer the embeddings to the new task. Finally, you can fine-tune the word embeddings using the small dataset you have at hands.

In a way, word embeddings is similar to face encoding that we saw in a previous blog (link here), where a face was “translated” into a vector of numbers.

Of course, a 300-feature representation of a word is a non-intuitive representation of a word. Luckily, there is an algorithm (t-SNE algorithm) that maps and visualizes this 300 data vector in a non-linear way to 2-dimensional space, allowing us to see clusters of words related to each other. Such as:

The most commonly used similarity function for inferring analogy reasoning is the “cosine similarity”, which measures the cosine between two vectors u and v:

where:

Learning word embeddings:

When we implement an algorithm to learn word embeddings, what we end up learning is an embedding matrix.

For a 300-feature embedding and a 10,000-word vocabulary, the matrix E will be (300x10,000). To learn the Matrix E, we initialize it randomly and then use gradient descent to learn all of its parameters.

Each word will be represented by the embeddings vector “e” such as:

For example, if woman is the 5678th word within the dictionary, its embeddings vector is the multiplication of the embedding matrix by the one-hot representation of the word:

In practice, performing this multiplication, which is very inefficient since there are many zeros within the one-hot vector, we use a specialized function that just looks up a column of the matrix E.

Let’s see more in detail different algorithms we can implement to learn the matrix E parameters.

Neural language model

For each training sentence we construct the one-hot vector for each word and multiply that by the matrix E to get the embeddings vector “e”.

Now we feed all these 300-dimensional vectors, representing each word within the sentence, to a neural network, which feeds into a softmax. The softmax classifies among all the 10,000 (dictionary size) possible outputs.

The parameters of the algorithms are the matrix E, the weights, the bias of the layer of the network and the softmax. We then use back propagation with gradient descent to maximize the likelihood of the training set, given the input words, what the next word would be.

If our goal is to predict the word embeddings, then you normally feed the entire sentence. If instead, the purpose is to build a language model, then we use the previous few words (i.e., four) from the target word. See below:

Word2Vec algorithm

This is a simpler and computationally more efficient algorithm to learn word embeddings rather than the neural language model. Word2Vec is a group of models that try to represent each word as a vector, making similar words also be close to each other. One of these models is the Skip-Gram model.

Skip-gram

The setup is quite simple. We have a “context”, which is composed of a randomly picked word. Then a “target”, which is a word picked within a defined window from the context.

This creates a supervised learning problem where, given the context word, you are asked to predict what is a randomly chosen word within a certain window.

We are going to train a simple neural network with a single hidden layer, to perform this supervised learning problem. We are not going to use the network for the task we trained on, but instead to learn the weights of the hidden layer, which are the word embeddings we are interested in gathering.

So, given a specific word in the middle of the sentence (the “context” word), we look at the words nearby and pick one at random (the “target” word). The network is going to tell us the probability for every word in the vocabulary of being the “target” word that we chose.

The output of the softmax in indeed:

theta_t is the parameter associated with the output t (a chance of the particular word “t” of being the correct one).

ec is the embedding vector of the context word.

The network is, therefore, trained by giving it word pairs taken from the sentence with the context-target method described above and represented below:

As mentioned, the input to the network is the context word, obviously the one-hot representation of it. So the context word is a vector with 10,000 positions since our vocabulary is composed of 10,000 words.

For training, the input is a one-hot vector representing the input word, and the training output is also a one-hot vector representing the output word (the target word described above,) When you evaluate the trained network on an input word, the output vector will be a probability that a randomly selected nearby word is that vocabulary word.

If we use our “standard” representation of 300 features for each word, then the hidden layer of the network will be represented by a weight matrix of dimension (10,000 x 300). These are the weights we are interested in learning. We don’t care about the output of the network.

It’s sort of a trick: we train a network for doing something, but we actually don’t care about the output of this supervised learning problem. Instead, we use it to learn the parameters (the weights) within the network.

The intuition behind all this is that if two words are appearing around each other often, then the model will output similar results for these words. One way for the network to output similar predictions for these words is if the vectors are similar.

So think about the network learning that when it sees the word “orange” then the word “juice” is also “in the neighborhood”. One way to predict that also with the word “apple” the word “juice” is close by, is when “apple” and “orange” are similar.

Then, finally, the network learns that “orange” and “apple” are similar vectors and predicts “juice” when it sees either of them.

I should mention a couple of problems when using the “skip gram” model. First of all, the computational speed. For the softmax to evaluate the probability, it consists in carrying out the sum over all the words within the vocabulary.

One solution is to use a “hierarchical softmax classifier”. Instead of trying to categorize something into all the words within the vocabulary (i.e., 10,000 words) on one go, there is a tree classifier that tells if the target word is within the first 5,000, and so on.

One other issue with this model is related to how to choose the context word. In fact, if we pick it randomly we might end up picking the most frequent words (i.e., “the”, “a”, “of”, …). In this case, the network doesn’t learn that much, as the training dataset doesn’t contain “interesting” word to learn.

Therefore, instead of a random selection, there are different heuristics that you can use to balance out “common” words with more “interesting” ones.

Negative sampling

This model is quite similar to the skip-gram, but it’s a more efficient learning algorithm. It still is set as a supervised learning problem like before, but with a little tweak.

Given a pair of words like “orange” and “juice”, we are going to predict if this is a context-target pair. We do so by generating a set of a positive and several (k) negative examples. So now our target will be a binary result (1 if the pair of words is a positive example, 0 otherwise).

To generate a positive example, we pick a context word, look around a window and pick the correct target word to go with it. To generate a negative example, instead, we take the context word and then a word at random from the dictionary. For a small dataset, the number of negative examples k is within a range of 5 to 20, while 2 to 5 for a large dataset.

Therefore, we now have a logistic regression model:

Similar to before, the network is trained inputting the one-hot representation of the word and the output this time is 10,000 possible logistic regression classification problems. Instead of training all 10,000 of them on every iteration, we are only going to use (k+1) of them. (k total negative examples, one positive example).

Therefore, instead of having a 10,000 positions softmax, we have 10,000 binary classification problems, which are much less expensive, computation-wise.

Furthermore, the training is done on every iteration, on a small set of (k+1) binary classification problems.

When sampling the words for negative examples, it is recommended to do it proportionally to the frequency of the word to the power of ¾. Therefore, we avoid sampling the most frequent words in the sentence.

GloVe Word Vectors algorithm:

In the previous models, we were sampling pairs of words (context and target words) within the text, driven by a certain set window. The GloVe algorithm does this right off the bat; by counting the number of times, a word “i” appears in the context of “j”.

Now we have seen how to learn word embeddings. In the following blog I will go over some interesting applications that use this feature.

This blog is based on Andrew Ng’s lectures at DeepLearning.ai

--

--

Michele Cavaioni
Machine Learning bites

Passionate about AI, ML, DL, and Autonomous Vehicle tech. CEO of CritiqueMatch.com, a platform that helps writers and bloggers to connect and exchange feedback.