# Word2Vec — a baby step in Deep Learning but a giant leap towards Natural Language Processing

**Let’s demystify word2vec with reasoning, examples and mathematics**

**Note from the autho**r : I am extremely happy and overwhelmed that so many people read and found this article useful. I am working on a project related to **ChatBot**. Word2Vec was a part of that bigger project. I will upload that here on my publication (https://medium.com/explore-artificial-intelligence) along with the codes and many more interesting things on AI which I am exploring as well. I don’t want you to miss on any of these, so if you think it’s applicable for you, you might want to follow this publication. With love :)

## Introduction

Word2Vec model is used for learning vector representations of words called “word embeddings”. This is typically done as a preprocessing step, after which the learned vectors are fed into a discriminative model (typically an RNN) to generate predictions and perform all sort of interesting things.

## Why learn word embeddings

Image and audio processing systems work with rich, high-dimensional datasets encoded as vectors of the individual raw pixel-intensities for image data, thus all the information is encoded in the data hence the relation between various entities in the system like (cats and dogs) can be established.

But, when it comes to natural language processing systems traditionally it treats words as discrete atomic symbols, and therefore ‘cat’ may be represented as Id537 and ‘dog’ as Id143.These encodings are arbitrary, and provide no useful information to the system regarding the relationships that may exist between the individual symbols. This means that the model can leverage very little of what it has learned about ‘cats’ when it is processing data about ‘dogs’ (such that they are both animals, four-legged, pets, etc.).

Representing words as unique, discrete ids furthermore leads to data sparsity, and usually means that we may need more data in order to successfully train statistical models. Using vector representations can overcome some of these obstacles.

Let’s take an example-

The traditional approach to NLP involved a lot of domain knowledge of linguistics itself. Understanding terms such as phonemes and morphemes were pretty standard as there are whole linguistic classes dedicated to their study. Let’s look at how traditional NLP would try to understand the following word.

Let’s say our goal is to gather some information about this word (characterize its sentiment, find its definition, etc). Using our domain knowledge of language, we can break up this word into 3 parts.

We understand that the prefix “un” indicates an opposing or opposite idea and we know that “ed” can specify the time period (past tense) of the word. By recognizing the meaning of the stem word “interest”, we can easily deduce the definition and sentiment of the whole word. Seems pretty simple right? However, when you consider all the different prefixes and suffixes in the English language, it would take a very skilled linguist to understand all the possible combinations and meanings.

Deep learning, at its most basic level, is all about representation learning. Here, we’re going to take the same approach by creating representations of words through large datasets.

## Word Vectors

Let’s represent each word as a d-dimensional vector. Let’s use d = 6. From this sentence, we want to create a word vector for each unique word.

Now let’s think about how to fill in the values. We want the values to be filled in such a way that the vector somehow represents the word and its context, meaning, or semantics. One method is to create a co-occurrence matrix.

A co-occurrence matrix is a matrix that contains the number of counts of each word appearing next to all the other words in the corpus (or training set). Let’s visualize this matrix.

Notice that through this simple matrix, we’re able to gain pretty useful insights. For example, notice that the words ‘love’ and ‘like’ both contain 1’s for their counts with nouns (NLP and dogs). They also have 1’s for the count with “I”, thus indicating that the words must be some sort of verb. With a larger dataset than just one sentence, you can imagine that this similarity will become more clear as ‘like’, ‘love’, and other synonyms will begin to have similar word vectors, because of the fact that they are used in similar contexts.

Now, although this a great starting point, we notice that the dimensionality of each word will increase linearly with the size of the corpus. If we had a million words (not really a lot in NLP standards), we’d have a million by million sized matrix which would be extremely sparse (lots of 0’s). Definitely not the best in terms of storage efficiency. There have been numerous advancements in finding the most optimal ways to represent these word vectors. The most famous of which is Word2Vec.

**Formal Treatment**

Vector space models (VSMs) represent (embed) words in a continuous vector space where semantically similar words are mapped to nearby points (‘are embedded nearby each other’). VSMs have a long, rich history in NLP, but all methods depend in some way or another on the Distributional Hypothesis, which states that words that appear in the same contexts share semantic meaning. The different approaches that leverage this principle can be divided into two categories:

- Count-based methods (e.g. Latent Semantic Analysis)
- Predictive methods (e.g. Neural Probabilistic Language Models)

The distinction is -

Count-based methods compute the statistics of how often some word co-occurs with its neighbor words in a large text corpus, and then map these count-statistics down to a small, dense vector for each word.

Predictive models directly try to predict a word from its neighbors in terms of learned small, dense embedding vectors (considered parameters of the model).

Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text. It comes in two flavors, the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model. Algorithmically, these models are similar, except that CBOW predicts target words from source context words, while the skip-gram does the inverse and predicts source context-words from the target words.

We will focus on the skip-gram model in the rest of the discussion.

## The Mathematics at work

Neural Probabilistic language models are traditionally trained using the Maximum Likelihood principle to maximize the probability of the next word wt (for “target”) given the previous words h (for “history”) in terms of a softmax function

where ** score(wt, h)** computes the compatibility of the target word

**with the context**

*wt***(a dot product is commonly used).**

*h*We train this model by maximizing its log-likelihood on the training set. So, let’s maximize the below loss function-

This yields a properly normalized probabilistic model for language modeling.

This same argument can be also shown in a slightly different formulation which clearly shows the choice variable (or the parameter) which is changed to maximize this objective.

Our goal is to find word representations that are useful for predicting the surrounding words given a current word. In particular, we wish to maximize the average log probability across our entire corpus:

This equation essentially says that there is some probability ** p** of observing a particular word that’s within a window of size

**of the current word**

*c***. This probability is conditioned on the current word**

*wt***and some setting of a parameter**

*wt***(determined by our model). We wish to set these parameters**

*theta***so that this probability is maximized across our entire corpus.**

*theta*## Basic Parametrization: Softmax Model

The basic skip-gram model defines the probability p through the softmax function as we had seen earlier. If we consider wi to be a one-hot encoded vector with dimension N and theta to be a N * K matrix embedding matrix where we have N words in our vocabulary and our learned embeddings have dimension K, then we can define-

It is worth noting that after learning, the matrix ** theta** can be thought of as an

**.**

*embedding lookup matrix*In terms of architecture it is a simple three-layered neural network.

- Take a 3 layer neural network. (1 input layer + 1 hidden layer + 1 output layer)
- Feed it a word and train it to predict its neighbouring word.
- Remove the last (output layer) and keep the input and hidden layer.
- Now, input a word from within the vocabulary. The output given at the hidden layer is the ‘word embedding’ of the input word.

This parametrization has a major disadvantage that limits its usefulness in cases of very large corpuses. Specifically, we notice that in order to compute a single forward pass of our model, we must sum across the entire corpus vocabulary in order to evaluate the softmax function. This is prohibitively expensive on large datasets, so we look to alternate approximations of this model for the sake of computational efficiency.

**Improving Computational Efficiency**

For feature learning in word2vec we do not need a full probabilistic model. The CBOW and skip-gram models are instead trained using a binary classification objective (logistic regression) to discriminate the real target words (wt) from k imaginary (noise) words ~w, in the same context.

Mathematically, the objective (for each example) is to maximize

This objective is maximized when the model assigns high probabilities to the real words, and low probabilities to noise words. Technically, this is called Negative Sampling, The updates it proposes approximate the updates of the softmax function in the limit. But computationally it is especially appealing because computing the loss function now scales only with the number of noise words that we select (k) and not all the words in the vocabulary (V). This makes it much faster to train. Packages like Tensorflow uses a very similar loss function called noise-contrastive estimation (NCE) loss.

## Intuitive Feel of Skip-gram model

As an example, let’s consider the dataset-

`the quick brown fox jumped over the lazy dog`

We first form a dataset of words and the contexts in which they appear. For now, let’s stick to the vanilla definition and define ‘context’ as the window of words to the left and to the right of a target word. Using a window size of 1, we then have the dataset of (context, target) pairs.

`([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...`

Recall that skip-gram inverts contexts and targets, and tries to predict each context word from its target word, so the task becomes to predict ‘the’ and ‘brown’ from ‘quick’, ‘quick’ and ‘fox’ from ‘brown’, etc.

Therefore our dataset becomes of (input, output) pairs as -

`(quick, the), (quick, brown), (brown, quick), (brown, fox), ...`

The objective function is defined over the entire dataset, but we typically optimize this with with stochastic gradient descent (SGD) using one example at a time (or a ‘minibatch’ of batch_size examples, where typically 16 <= batch_size <= 512). So let’s look at one step of this process.

Let’s imagine at training step we observe the first training case above, where the goal is to predict ** the** from

**. We select**

*quick***number of noisy (contrastive) examples by drawing from some noise distribution, typically the unigram distribution (The unigram posits that each word occurrence is independent of all other word occurrences. i.e. we can think of the generation process as a sequence of dice rolls as an example.),**

*num_noise*

*P(w)*For simplicity let’s say num_noise=1 and we select ** sheep** as a noisy example. Next we compute the loss for this pair of observed and noisy examples, i.e. the objective at time step ‘t’ becomes -

The goal is to make an update to the embedding parameters ** theta** to maximize this objective function. We do this by deriving the gradient of the loss with respect to the embedding parameters

*theta*We then perform an update to the embeddings by taking a small step in the direction of the gradient. When this process is repeated over the entire training set, this has the effect of ‘moving’ the embedding vectors around for each word until the model is successful at discriminating real words from noise words.

We can visualize the learned vectors by projecting them down to 2 dimensions. When we inspect these visualizations it becomes apparent that the vectors capture some general, and in fact quite useful, semantic information about words and their relationships to one another.

## My Next Blog

An Introduction to Recurrent Neural Networks

## Sources

- Tensorflow implementation of Word2Vec
- Distributed Representations of Words and Phrases and their Compositionally — A research paper by Thomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean
- A practical guide to word2vec by Aneesh Joshi
- Natural Language Processing Review by Adit Deshpande
- Language Model by Rohan Verma