Paper Summary: Skip-Thought Vectors

4 min readNov 26, 2018

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/17, with better formatting.

Skip-Thought Vectors (2015) Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler

This paper is a bit of a step backwards in time compared to yesterday’s ELMo paper, but I was curious about how word vectors can be aggregated into larger structures and the idea of a sentence representation is intriguing. Especially since one of the current limits in NLP seems to be understanding very long term dependencies (e.g. across-chapter) and large-scale structure, and the word → sentence jump is a small part of that. There aren’t too many surprises here, if you’ve been following the NLP papers series, but there is a new approach to the out-of-vocabulary problem that I hadn’t seen before.

Composing words into sentences, or rather word representations into sentence representations, isn’t a new idea, and lots of approaches have been tried. Rather than take an opinionated stance on how to do this composition, this paper takes a step back and tries an approach (called Skip-Thoughts) that’s orthogonal to the specifics of how sentence representations are formed. The idea here is to apply a version of the word2vec skip-gram objective at the sentence level. So this is more about a new way of getting a training signal from unlabeled data than about composing sentences per se. To evaluate the results unambiguously, the sentence representations are used downstream as features in a simple linear model. It’s possible to do something more sophisticated, but this paper isn’t focused on results at the cost of clarity, and I for one appreciate that.

The authors use an encoder-decoder framework with a standard GRU encoder and a conditional GRU decoder — more on what this means in a moment. (GRU by the way is like LSTM, but a bit simpler and empirically performs about as well. It’s possible to use LSTMs here too.) Following the skip-gram idea, we focus on a sentence tuple (si-1, si, si+1); the encoder encodes the middle sentence si and the decoder predicts the other two sentences — there are actually two decoders, one for i-1 and one for i+1. The loss function is the sum of the log probabilities of the surrounding sentence words.

So what is this “conditional” GRU? The best way to talk about this might be to refer directly to the decoder equations in the paper:

This is mostly just standard GRU fare: the xt-1 is the previous input word embedding, the ht-1 is the previous hidden decoder state. The interpretations of r, z, and h-bar are “reset gate”, “update gate”, and “candidate hidden state”, respectively. Confusingly there’s also an hi, which is the hidden state of the encoder GRU (so watch your subscripts). This hi comes in as a conditioning signal that, when transformed by Cr, Cz, and C, biases the decoder in ways that depend on the input words. (The best explanation of this kind of conditioning signal I’ve seen is in the Feature-wise transformations Distill paper.) Note that this is the only path for information from the encoder to make it to the output of the network. And as mentioned, the objective minimizes the sum of the log probabilities of the words in the previous and follow sentences.

I alluded to a solution to the out-of-vocabulary problem earlier, an approach the authors call vocabulary expansion. The problem being that the training dataset doesn’t include rare words, and also the authors limited the vocabulary size to 20k for performance reasons. One thing to keep in mind is that all we need is for the encoder to do a reasonable job of representing these OOV words — we don’t actually need the decoder to predict OOV words. The vocabulary expansion approach is to use a much larger (~900k word) CBOW word2vec model and learn a linear mapping from the word2vec space to the encoder GRU’s embedding space. The results? The nearest neighbors to the OOV word “Tupac” are “2Pac”, “Cormega”, “Biggie”, “Gridlock’d”, “Nas”, “Cent”, and “Shakur”. Seems legit.

The authors note that initializing with pretrained embeddings (and using hierarchical softmax or similar) was a possible alternative to vocabulary expansion, as was using character-level input.

The evaluation was done using simple linear models that treat Skip-Thought vectors as feature extractors. The tasks were semantic relatedness (based on a dataset (appropriately?) called SICK, where humans rated how similar sentences were to each other), paraphrase detection (which seems like a similar kind of relatedness task), and image-sentence ranking. This last deserves a mention, since I’ve always found this kind of cross-media task a bit mind-blowing (indeed, one of the motivations for going so deep into images and language models papers was to build up to the DeViSE paper, summary forthcoming). The idea here was to pick out the correct image caption by extracting image features (they used OxfordNet, a.k.a. VGG-19) and skip-thought sentence features and putting them all into a linear model. Another thing of note: they use the same negative sampling method as Mikolov 2013.

Finally, they also test several classification tasks. Here the simple linear model doesn’t beat baselines and performs worse against methods that learn task-specific sentence representations, but combining skip-thoughts and bag-of-words is competitive.

Paper Summary: Skip-Thought Vectors

Written by Mike Plotz