[Natural Language Processing] Word2Vec

Published in

moonote

5 min readJun 13, 2019

Word to Vector (Word2Vec)

Word2Vec model is used for learning vector representations of words called “word embedding”, which use vector space to represent each word.

Word Embedding

What’s word embedding? When doing the text analysis, we hope that each word is able to show its literal meaning. Here comes an idea that the meaning of target word could be represented by the words (or context) around it.

Let’s go through the details of it with the following sentence:

We can represent each word as a n-dimensional vector. Let’s use n= 6 for the length of whole sentence. From this sentence, we want to create a word vector for each unique word.

We want the values to be filled in such a way that the vector represents the word and its context, meaning, or semantics. One method is to create a co-occurrence matrix.

A co-occurrence matrix is a matrix that contains the number of counts of each word appearing next to all the other words in the corpus .

Now by compared the similarity of the vector of each word, we could find the word with similar meaning in the sentence. For example, the meaning of love and like are close in this case.

However, when the number of words in the sentence/article increases, co-occurrence matrix become not efficient and will cause sparsity problem. Thus, to deal with the dilemma, we can apply other word embedding method.

Word embedding by Word2Vec model

There are main two kinds of word embedding method :

1. Count-based and rely on matrix factorization (e.g. LSA, HAL)
2. Window-based (e.g. the skip-gram and the CBOW models)

Here we focus on the window-based method, CBOW and Skip-gram.

1. CBOW:

The CBOW predict the current target word (the center word) based on the source context words (surrounding words). Considering a simple sentence, “I love NLP”, this can be pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like ([I, NLP], love), ([love, ], NLP), ([I, ], love],) on the following(Figure1) . Thus the model tries to predict the target_word based on the context_window words.

2.Skip-Gram:

On the opposite of CBOW, Skip-gram model predict the context words around the center word by using the center word as input.

For each word at our corpus, we wanna take it as input to predict the words around it (not all words, but those in the default range).

Objective Function of Word2Vec

But how to calculate the do P(Wt+j | Wt ; theta) ?

we can use two vector to perform the calculation of each word:

Neural Network Model Structure — Word2Vec model

After knowing the model concept of Word2Vec and the objective function, the following step is to train a model and generate the word embedding. A shallow neural network with a single hidden layer could help us to perform a certain task.

Notice that we’re not going to use whole neural network for the task we trained it on! Instead, the goal is actually just to learn the weights of the hidden layer–we’ll see that these weights are actually the project layer which enable us to get “word vectors”.

There is no activation function on the hidden layer neurons, but the output neurons use softmax classifier.

For our example, we’re going to say that we’re learning word vectors with 300 features (hyperparameter, what Google used in their published model). So the hidden layer is going to be represented by a weight matrix with 10,000 rows (the number of word in our context) and 300 columns (one for every hidden neuron, same as the number of features).

If you look at hidden layer linear neurons above,that’s actually the Word2Vec model we want!

Reference

[1]: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013, January 17). Efficient Estimation of Word Representations in Vector Space. arXiv.org.

[2]: Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013, October 17). Distributed Representations of Words and Phrases and their Compositionality. arXiv.org.

[3]: Stanford / Winter 2019, CS224n: Natural Language Processing with Deep Learning, http://web.stanford.edu/class/cs224n/