NATURAL LANGUAGE PROCESSING

(PART 2 — Word2Vec)

--

Word2Vec

One of the biggest problems in natural language processing is to symbolize the meanings of words in a way that machines can understand. If computers can attribute meanings to the words people use, they will come closer to understanding our language. Words must be represented numerically for the computer to understand. When we represent words numerically, we can apply mathematical operations to words.

We can represent words as one-hot vectors
We can represent words as one-hot vectors.

If we create one-hot vectors, the relationship between these words is lost. So, we need other techniques for words representation. The most modern techniques for creating word vectors: Word2Vec and Glove. In this post i explain Word2Vec. I will tell you about Glove in the next post.

Word2Vec creates a model that tries to guess the words around it using the middle word. Word2Vec is a method to construct such an embedding. It can be obtained using two methods (both involving Neural Networks): Skip Gram and Common Bag Of Words (CBOW)

Skip-Gram

Firstly, we choose our input word. (the input word of this sentence is “motivation”). We select window size 2. Our model tries to guess words o 2 window sizes from the right and left of the input word. To achieve this goal Skip-gram will work as a predictor of the context words within a window of fixed size m for a given center word.

Hence, for a vocabulary of size T the Skip-gram model will want to maximize the following predictive accuracy or likelihood.

For ease of computation, and because we prefer minimizing rather than maximizing functions in machine learning, we will now consider and minimize the average negative log probability:

Let's optimize the word vectors to reduce the loss. Each word has two vectors. One is the vector that word in the middle, and the other is the vector with that word on the edge. Then we will use the softmax function to compute the probability of a context word o (outside word) given a center word c.

We calculate the dot product between two vectors. If the value we got is high, the two words are close. If the dot product result is small, its words are far apart.

Let’s show all parameters as vector Q.

Optimization

Let's take the partial derivative of our negative log-likelihood formula. Our aim is to change the parameters in the most appropriate way and bring the result closer to zero.

CBOW

We predict the middle word according to the words on the edge.

We can modelthis CBOW architecture now as a deep learning classification model such that we take in the context words as our input, X and try to predict the target word, Y. Let’s talk about the details of CBOW in another blog post…

Word2Vec with Python Code

Firstly, i import the necessary libraries. Numpy library to do operations on vectors and matrices. Gensim is a popular open-source NLP library, so i import Word2Vec to this library. Then i import matplotlib library and TSNE from sklearn library to visualize the relationship between words.

Let's open text files with the “open” command. Using 'utf8' will ensure correct reading of Turkish characters. I assign the text to the txt variable.

Separating from all the new lines I have.

Loop takes a sentence with each new loop and splits it into words.

Look at the words in the first three sentences.

The length of our word vectors is 50. With our window span, we will get 4 words from the right and left of the word in the middle. We will take words that occur at least 5 times in the corpus. We are stating that we will use the Skip-gram algorithm with “sg”.

Vector representation of the word “mountain”.

These vectors are the closest to the word “sports”.

SKIP-GRAM VS CBOW

The Skip-gram model is designed to predict the context. For a center word, all the 6 context words are predicted and loss is calculated based on the 6 predictions. It works well with a small amount of the training data, represents well even rare words or phrases.

CBOW is learning to predict the word by the context. For the 6 words, only one prediction is done and loss is calculated based on a prediction of the center word. Several times faster to train than the skip-gram, slightly better accuracy for the frequent words.

SOURCES

--

--