Word vector representations — Word2Vec

Natural language processing (NLP) is a way for computers to understand, analyze and derive meaning from human language in a smart and useful manner. In last few years, the field is seen to be shifting it’s gears from traditional statistical methods to neural network models. Deep neural networks have emerged as one stop solution for dozens of language processing tasks and achieving state-of-the-art results on some specific language problems such as speech recognition.
The basic and an important building block for all language processing task is creating word representations that would capture their meanings, semantic relationships and different types of contexts they are used in. One simple and straight forward way is to represent them as one-hot vectors. In vector space terms, this corresponds to a 1-D vector with one 1 (corresponding to the represented word) and zeros for all other words in vocabulary.
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
one hot encoding for some word e.g - hotelHowever, there are few issues with this representation :-
Data sparsity
Notice that this vector encoding holds meaningful data only at single position per word. Considering we have a vocabulary of 1 million words (and that’s a norm in natural language processing domain), we would be using a matrix of 1 million x 1 million elements mostly filled with zeros. That’s too much inefficiency with regard to storage and computation cost.
Localist representation
Secondly, with one-hot encoding we are representing a word independent of any context. So, it doesn’t give any inherent notion of relationship between different words. For e.g this representation fails to find similarity between “Dell notebook battery size” and “Dell laptop battery capacity”.
Instead we need an approach where representation of words encode their meanings in such a way that we could read off similarity directly from these representations. We can achieve this by using widely applied concept in the field of natural language processing — Distributional Similarity.
Distributional Similarity based representations
It simply derives from the fact that we can get lot of value for representing the meaning of a word by looking at the context in which it appears.

For example, if I want to know what the word drive means, I am going to collect a whole lot of text containing word drive in it and somehow use those words in a context (schedule, financing, Golf, recent, test, wheel etc) to represent the meaning of drive. To do so, we’ll build a vector for each word such that it would be good at predicting other words in it’s context. And each of those other words in turn would also be represented by vectors. In this way, we could just look for similarity measures between any given vectors just using a dot product.
In general, we need a model that could predict the probability of context word given the target word. Any size for context window would do depending upon the size of text corpus available. Our final goal is to change the representation for the words until loss function defined over the model is minimized. Specifically, there’s one such model which serves the purpose — Word2Vec.
Word2Vec
The emphasis of Word2Vec model is to build a simple and scalable model which we can run over exceedingly large corpus of text and build good word representations. There are two such algorithms for that, one more popular than another. As skip gram is the one used in practice, we’ll focus more on that
- Skip-grams
- Continuous Bag of words (CBOW)
Skip-grams

The idea of skip gram model is to choose a target word and then predict the words in it’s context to some window size. It does this by maximizing the probability distribution i.e probability of the word appearing in the context (within the specified window) given the target word. This can be represented mathematically by following objective function

where k is the window size (context words before and after the target word), t is the index of target word in big corpus, T is the length of text and parameter (theta) is vector representation of words. In essence, this objective function, for each word in text as target word, will choose the parameters such that it will maximize the probability distribution of context words.
In practice, whenever we are maximizing the probabilities it’s almost always better to take a log as then we have to deal with summations instead of products which makes life a lot easier. So, after applying log, our new objective function would be

Just to fit in the realm of machine/deep learning, where preference is to minimize things than to maximize, we’ll be taking the negation of average over the whole corpus. Now, only thing left is to come up with probability distribution for context words given the target word. We could represent probabilities mathematically as follows

where v over sigma signifies the size of vocabulary, c and t are the index of context and target words in the vocabulary respectively, u is the parameter matrix associated with context words and v is the vector representation of target word. Here, dot product measures the degree of similarity between two word vectors whereas exponentiation forces the result to be positive. Finally, softmax can be employed to convert these numbers into probabilities. So far so good, however there’s one small issue lurking in a form of high computational cost. Notice that for each context target pair we must sum (the denominator) across entire corpus to compute the softmax probabilities. Clearly for very large text corpus this would be highly inefficient. Usually, this approach is almost always compounded with few other techniques to produce efficient results. Two such techniques are :-
- Hierarchical softmax
- Negative sampling
In practice, negative sampling always outperform it’s counterpart hierarchical softmax. So, we’ll be exploring that only.
Negative Sampling
With negative sampling, along with context and target pairs, we are randomly going to select just a small set of words (k) from our corpus which could not appear within the context of given target word. This way we can predict the probabilities of all k+1 words (one positive context word and k negative words) feeding the target word vector to our algorithm. And hence at each iteration we are just going to train the network for only these k + 1 samples.
The research says that selecting 5–20 words works well for smaller datasets, and 2–5 words would be good enough for larger datasets.
Henceforth, our modified objective function would become

Notice that this objective function is maximized when probability of context word is maximum and probabilities for negative samples are minimum.
How to choose negative samples
One way is to sample according to empirical frequency of words in our text corpus. However, issue with this approach is that we end up with very high representation for words like ‘the’, ‘and’, ‘of’. At the other extreme is to sample the words uniformly random over 1/VOCAB_SIZE. That, as well, would not provide good representation of distribution of English words. In practice, what is work to found best is to generate negative samples according to following distribution

where v is the vocabulary size, f(wi) the observed frequency of particular word and P(wi) is the probability for selecting a word wi. Using this we can generate negative samples good enough to serve our purpose for generating word representations which would reflect the context in which those words are used. Later these word embedding could be used for bunch of other natural language processing tasks such as named entity recognition, machine translation, sentiment analysis and natural language modelling.
The simplest implementation of this model can be found over here.
Resources
Please let me know in comments any improvements/modifications this article could accommodate.

