Paper Summary: Distributed Representations of Words and Phrases and their Compositionality

5 min readNov 23, 2018

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/13.

Distributed Representations of Words and Phrases and their Compositionality (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean

I started reading the fastText paper yesterday and noticed that it kept referring back to Mikolov’s work. Specifically the “Distributed Representations of Words and Phrases” paper rather than the “Efficient Estimation of Word Representations” paper I summarized yesterday. So I went back and read that one and turns out that much of what we’ve come to know as word2vec is contained and explained there. Not to mention some details (hierarchical softmax, specifics of the log-linear models) being fleshed out a bit more in ways that I found helpful. So here it is, from my perspective — it should be pretty short since there’s considerable overlap and I’ll be focusing on the delta.

This paper is in many ways just a grab bag of disparate methods that the authors found improved their skip-gram model:

Sub-sample frequent words to improve speed (~2–10x faster) as well as accuracy on rarer words
Negative sampling as a simpler alternative to hierarchical softmax
Identifying phrases (e.g. “Canada Air” vs. “Canada” and “Air”) and treating them as single tokens
An interpretation of word vector addition (not really a method — they just needed a place to put an interesting idea, I guess)

First up, a review of the skip-gram model from the first word2vec paper. Which I found clearer, not least because they now write down the objective function to be maximized:

, or the average log probability of the context words given the central word. Here T is size of the corpus, wt are the words, and c is (half) the size of the training context. The probability p is the output of a (hierarchical) softmax over word similarities. Similarities isn’t quite right — maybe affinities is a better term. The idea is that each word has an “input” and an “output” vector (vw and v ′w) and the affinity is the dot product of word A’s input with word B’s output.

Negative sampling. This is a simplification of something called Noise Contrastive Estimation (NCE, Gutmann and Hyvärinen 2012), the idea being that you should be able to distinguish positive examples from negative examples using logistic regression (a.k.a. binary classification) rather than picking out the correct class from the entire vocabulary. Thus avoiding hierarchical softmax entirely. This is done by sampling k times from a “noise distribution” over the vocabulary and training the model to pick out the correct word (k was ~2–20, with larger k for smaller datasets). NCE additionally uses the numerical probabilities of the noise distribution for some statistical guarantees, but the simpler negative sampling approach turned out to be sufficient. One interesting point is that they used the ¾ power of the unigram distribution for negative samples, which was empirically better than other distributions they tried — the thing I’m curious about is how they thought to try this particular distribution and what the search process looked like. I’m reasonably sure I wouldn’t have been able to land on this result.

Frequent word subsampling. This is one of those obvious-in-retrospect ideas, to me. So you see ten times as many “the”s as anything else — maybe you want to spend your time predicting more interesting words. Makes sense, but the implementation details are a bit more complicated and interesting, as usual. They do the subsampling by discarding training words with probability

, where t is a threshold (~1e-5) and f is the word frequency. What’s going on here? Seems simple, but it’s actually a bit hard for me to visualize. One thing to notice is that the term goes negative for frequencies more rare than t, which I’m interpreting as clamping the discard probability at 0 below. In other words, we don’t ever throw away rare words. This does throw away common words, though: 99% of f = 0.1 words are tossed, as are 90% of f = 0.001 words. The authors claim that this preserves frequency ranking — this seems true, but I’d have to do a bit of work to convince myself.

Phrases. There’s a whole literature on this, apparently. The authors choose a quick and dirty scheme that seems to work pretty well: basically do a pass over the training data scoring bigrams by the ratio of the bigram count with the product of the unigram counts, with a tuneable discounting coefficient (to keep from pairing up infrequent words that happen to land side by side). Bigrams that score above a threshold are combined into a single token. Doesn’t this restrict us to two-word phrases, you ask? Yup. So we just run the thing again a few times. Empirically this seems to work well. E.g. San Jose : San Jose Mercury News :: Cincinnati : Cincinnati Enquirer. And even better, Belgium : Brussels Airlines :: Greece : Aegean Airlines.

The authors finish with a few words about “additive compositionality.” The question is, how should one interpret the sum of two word vectors? Empirically we get results like vec(“French”) + vec(“actress”) ≈ vec(“Juliette Binoche”), but why? The authors argue that a word’s vector represents the distribution of the word’s context, and since these are log probabilities (by virtue of the objective function’s structure), adding vectors is like multiplying probabilities. So this acts like an AND function. As the authors put it, “… if “Volga River” appears frequently in the same sentence together with the words “Russian” and “river”, the sum of these two word vectors will result in such a feature vector that is close to the vector of “Volga River”.” I’d actually put it a bit more strongly, since we’d get the same effect even if the three terms never appear together.

Next up, fastText, which builds on some of the ideas here while also looking at the characters inside words to take advantage of morphology.

Gutmann and Hyvärinen 2012 “Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics” http://www.jmlr.org/papers/v13/gutmann12a.html

Paper Summary: Distributed Representations of Words and Phrases and their Compositionality

Written by Mike Plotz