Paper Summary: Deep contextualized word representations (ELMo)

Mike Plotz
3 min readNov 26, 2018

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/17.

Deep contextualized word representations (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer

This is the latest in a stream of refinements to the problem of representing words as vectors. Like CoVe, we want to make use of deeper layers. Unlike CoVe, this paper doesn’t attempt to perform machine translation, instead relying only on unlabeled corpora, of which there is an unending supply. (Wasn’t the availability of MT datasets supposed to be one of CoVe’s selling points? hmmm….) The other fun thing in this paper, though it doesn’t go into too much detail on this point, is the use of character-level convolutions to extract subword information, with benefits similar to fastText / Bojanowski 2016 (summary). The approach is called embeddings from language models, or ELMo.

This paper builds on Peters 2017 (same first author, otherwise all different authors) and, as mentioned, McCann 2017, a.k.a. CoVe (summary). When direct comparisons are possible, ELMo outperforms CoVe. The idea is to use a bidirectional language model (biLM): a couple layers of bidirectional LSTM and a next- and previous-word predictor (so, maximize the sum of the log likelihoods of each token in both directions). The forward and backward directions share weights for token representations and softmax, but not internal weights. After training, all the layers are combined in a task-specific weighted sum to form the ELMo representation. Plus sometimes layer norm is added before weighting. There’s also a detail about scaling the whole ELMo vector in a task-specific way that I’m skipping over — this is covered in the appendices.

To use ELMo representations, the biLM weights are frozen and the representation is passed into the task RNN. For some tasks it helped to also include (a different weighting of) ELMo at the RNN output — this seems effective when there is an attention mechanism in the output layers.

The final model was based on Józefowicz 2016 and used 2 biLSTM layers (4096 hidden units, input/output dimension 512) with a residual connection between layers (which I’m reading as, a residual connection that spans the first layer). The input tokens flow into 2048 character-level convolution filters (seems like a lot!), followed by 2 highway layers (I had to look this up — it’s like a residual block, but instead of x + layer(x) it’s a learned linear combination — Srivastava 2015), projected down to 512 dimensions.

So there are 3 exposed layers, counting the “context insensitive type representation” (this is the moral equivalent of a word2vec or fastText vector), that can be combined into the ELMo representation. Like with fastText, with this method they don’t have to worry about out-of-vocabulary words because they use subword information. The biLM can be fine-tuned on domain-specific data, which helps quite a bit for most tasks.

The approach was evaluated on a bunch of different subsequent tasks — there are lots of details in §4 of the paper and the appendices. Ablation analysis showed that learning the layer weights was better than just averaging the biLM layers was better than just using the last layer (as is common in previous architectures). There’s also some analysis on what kind of information is in which layer. The first biLM layer turns out to be better for part-of-speech tagging, while the second layer is better for word sense disambiguation (e.g. “play” as in sports vs. “play” as in Shakespeare). This pattern also shows up in CoVe and other models. This layer-to-layer difference goes a long way towards explaining why weighting the layers might be useful for different tasks.

Bojanowski et al 2016 “Enriching Word Vectors with Subword Information” https://arxiv.org/abs/1607.04606

Józefowicz et al 2016 “Exploring the Limits of Language Modeling” https://arxiv.org/abs/1602.02410

Peters et al 2017 “Semi-supervised sequence tagging with bidirectional language models” https://arxiv.org/abs/1705.00108

Srivastava et al 2015 “Training Very Deep Networks” https://arxiv.org/abs/1507.06228

--

--

Mike Plotz

yet another bay area software engineer • learning junkie • searching for the right level of meta • also pie