Paper Summary: Learned in Translation: Contextualized Word Vectors

5 min readNov 24, 2018

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/14, with better formatting.

Learned in Translation: Contextualized Word Vectors (2017) Bryan McCann, James Bradbury, Caiming Xiong, Richard Socher

McCann et al noticed that lots of folks were getting good results by pre-training CNN models on ImageNet and using the resulting network on other image tasks (something the cognoscenti like to call transfer learning) while everyone over in NLP land seemed perfectly happy with single-layer unsupervised word vectors. The authors suspected that there might be an analogous task that would usefully train multi-layer LSTMs in a way that would usefully transfer. So they fired up their latest and greatest word vector models and did a nearest neighbor search for vec(“ImageNet”) — vec(“CNN”) + vec(“LSTM”) and found “machine translation”. Just kidding, I assume they just thought about it real hard or something. Whatever their search process was, once they thought of machine translation (MT) they realized it had a pretty good chance of working, since the task requires some way of putting words in their contexts. Plus there are large MT datasets available.

The basic approach here is to pre-train a 2-layer bidirectional LSTM for attentional sequence-to-sequence (in this case English to German) translation, starting from pretrained GloVe word vectors. And then take the output of the sequence encoder, call it a CoVe (context vector), concatenate it with the GloVe vectors, and use that in a task-specific model (for classification or question answering). The rest is just details.

Still here? Ok. But there are a lot of details. And I won’t be going into how GloVe works (that’s another paper that I might not get to — maybe try this explanation), nor will I go into why attention does its thing (though I’ll cover that next, on a simpler paper). So there’s not much in terms of intuition pumps from here on out and it’ll be terse. Don’t say I didn’t warn you.

So, the MT model. Start with a word sequence w (this is a departure from earlier papers I’ve summarized, which use w for a single word) and GloVe(w), a sequence of word vectors. Input these into a 2-layer bidirectional LSTM (bidirectional meaning actually two LSTMs, one looking at the forward sequence, one looking at the reversed sequence; the outputs are concatenated). This is the encoder, or MT-LSTM (which we’ll use later). Encoder output H is fed into the decoder, a 2-layer unidirectional LSTM with softmax attention over H. At timestep t the decoder generates its hidden state htdec from the embedding zt-1 of the previous target word, the “context-adjusted” hidden state (this just means it uses encoder activations H, weighted by attention, transformed, and squashed by tanh), and also the previous hidden state ht-1dec. The decoder outputs the softmax of yet another transform of the context-adjusted hidden state. (Said I’d be terse, wasn’t kidding.)

The output of the MT-LSTM (the encoder above) is the CoVe of the input w. In practice they actually used the concatenation [GloVe(w); CoVe(w)] for subsequent tasks.

Using CoVe for classification. This is annoyingly complicated and involves many steps that are just stated as is — the justification presumably being that this was what they had to do to get good results. They use something called a biattentive classification network (Seo 2017). “Biattentive” because there are two input sentences, each of which is used to attend to / condition on the other. The two inputs might be a document and a question about the document; for single input classification the input is just used twice.

The [GloVe(w); CoVe(w)] concatenated vectors are passed into a feedforward + ReLU network (exact architecture not specified) at the bottom, each of which is then encoded with another biLSTM and stacked on the time axis into X and Y matrices. The diamond biattention block computes affinity matrix A = XY⊤, then a softmax over the columns: Ax = softmax(A); Ay = softmax(A⊤). The inputs are then weighted by these attention matrices: Cx = Ax⊤X and Cy = Ay⊤Y. Taking a moment to ponder what these mean… I think what this means is something like Cx is a representation of the first input when conditioned by the second, and Cy is yadda yadda mutatis mutandis. Anyway. These are integrated by stuffing the concatenation [X; X — Cy; X ⊙ Cy] into yet another biLSTM and pooled (with max and mean, and for some tasks also min and softmax pooling, which they describe), ditto for the other side, and finally combined into a maxout network.

When in doubt, concatenate. Or, we don’t know what’s going to be informative, so let’s use all the things. (I don’t actually know what to make of all of this. It’s not crazy, exactly, but you know it’s early days when we’re still building this kind of thing by hand. Isn’t this what compilers are for?)

For the question answering task they started in the same way, used tanh instead of ReLU, and fed the rest to a Dynamic Coattention Network (Xiong 2016). I haven’t looked into this, but it sounds neat.

CoVe helps on all the metrics, at least when pre-trained on medium to large datasets (small MT pre-training maybe hurts a little on some tasks? or maybe it’s just noise?). Even better is combining CoVe with character n-gram vectors (similar idea as the previous paper, but they used Hashimoto 2016 instead). They also compare to skip-thought vectors, which I was planning on summarizing, noting that their model performs better and is more stable to train (because of the small dimensionality — 600 vs 4800 — though I don’t have a clear intuition why this is so). The authors don’t talk about speed or compute, so I’m guessing this is all quite slow to train.

Hashimoto et al 2016 “A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks” https://arxiv.org/abs/1611.01587

Seo et al 2017 “Bidirectional Attention Flow for Machine Comprehension” https://arxiv.org/abs/1611.01603

Xiong et al 2016 “Dynamic Memory Networks for Visual and Textual Question Answering” https://arxiv.org/abs/1603.01417

Paper Summary: Learned in Translation: Contextualized Word Vectors

Written by Mike Plotz