Paper Summary: Efficient Estimation of Word Representations in Vector Space (word2vec part 1)

Mike Plotz
3 min readNov 19, 2018

--

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/12.

Efficient Estimation of Word Representations in Vector Space (2013) https://arxiv.org/abs/1301.3781 Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean

This is the famous word2vec paper. The now-familiar idea is to represent words in a continuous vector space (here 20–300 dimensions) that preserves linear regularities such as differences in syntax and semantics, allowing fun tricks like computing analogies via vector addition and cosine similarity: kingman + woman = _____. This is actually not a new idea — Hinton’s “Distributed Representations” is from 1986! The word2vec paper is notable for its implementation details and performance as much as for any new conceptual ideas.

The method is an unsupervised one, in the sense that it relies only on natural language corpora and doesn’t require labeled data. This is possible due to the distributional hypothesis, the observation that words that show up in similar locations have similar meanings or senses. The paper explores a few approaches in turn with an eye towards minimizing computational complexity, starting with neural language models (NNLM and the recurrent RNNLM) and introducing the log-linear models continuous bag-of-words (CBOW) and skip-gram.

Before I get to the CBOW and skip-gram models I’ll mention a detail I wasn’t aware of before reading this paper: the size of the vocabulary can be quite large — datasets in this paper had V of 30K, 82K, and 1M — which, iff implemented in the naive way, causes the output layer to become a bottleneck. So the authors use hierarchical softmax, representing the vocabulary as a Huffman binary tree (more common words being closer to the root), reducing the complexity to O(log(V)). A little poking around reveals that there’s a newer thing called adaptive softmax (code) and also lots more on this on Sebastian Ruder’s blog.

The log-linear models. The idea here is to save a bunch of compute by removing the large fully-connected layers of the NNLM (I’m skipping over some stuff, see the paper for details) and using a simpler log-linear model instead to learn word vectors. An NNLM can then be trained on top of the vector representations.

The first model is CBOW, which is trained by predicting a single word from the 8 words that surround it (4 before and 4 after). It’s called a bag-of-words model because the 8 context word vectors are averaged, which throws away information about word order. The other model is skip-gram, which is sort of like the reverse of CBOW; given a single central word it predicts the words within a range of the central word (the range is chosen randomly up to a certain value, say 5 — this variability effectively upweights words that are nearby). My intuition says that this makes more efficient use of proximity information, updating the central word’s vector representation with information from multiple words.

[Update 2018/11/13: there’s a more detailed explanation of the log-linear model they used in the followup paper — summary coming today.]

For a word relationship task, the skip-gram approach does best overall, though CBOW does a bit better on syntactic (vs semantic) relationships. On sentence completion, training an RNNLM starting with skip-gram word vectors performs best. The paper also includes some details about training and parallelization, as well as showing examples of word pair relationships. (An example: given the relationship Einstein — scientist, we can retrieve Messi: midfielder; Mozart: violinist; Picasso: painter — these are apt, but varying degrees of apt.)

--

--

Mike Plotz

yet another bay area software engineer • learning junkie • searching for the right level of meta • also pie