Summary: Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model

Anthony Chen
3 min readOct 9, 2018

--

Authors: Sebastian J. Mielke and Jason Eisner

Papers in language modeling generally fall into one of two categories, modeling at either the word or the character level. Word level language models (WL-LM) suffer from a closed vocabulary, and while character level language models (CL-LM) do something to address this issue, CL-LMs are forced to remember the spelling of a word as well its meaning and the previous words in terms of individual characters. All this means a CL-LM can suffer in capturing long range dependencies.

This paper tackles the issue with inspiration from the linguistic concept of duality of patterning. The idea is that the form (spelling) of a word is independent from its function (usage).

Model

Image taken from original paper.

The model diagram is attached above. The authors augment a WL-LM LSTM with a speller LSTM, such that given an embedding of a word within a vocabulary, the speller generates the spelling. But how do we deal with the spelling of UNK tokens? Clearly we cannot feed the UNK embedding into the speller LSTM or this will give us the same spelling each time. Instead the authors feed the hidden state into the speller for spelling UNK tokens. This two level model allows the WL-LM to focus on capturing the semantics of word and context while offloading the spelling to the speller.

The loss of our model is now computed over the next word prediction, the spelling of the words in our vocabulary, and the spelling of UNK tokens. One particular note is during training for words in the vocabulary, the loss of the speller is not computed each time the word is generated, as this would bias the speller to common words. Instead, a sampling strategy is used so that all words in the vocabulary have their spellings updated about the same number of times.

This model gets strong results on several open vocabulary LM tasks. See paper for more quantitative results.

Samples

These are taken from the paper’s appendix. Bolded words were predicted UNK and are generated from speller. Otherwise from vocabulary.

Samples generated from model with vocabulary size of 5000. Sampling temperature of 0.75.
Samples generated from model with no vocabulary. Sampling temperature of 0.75.

Conclusion

Language modeling is one of my favorite areas of research in NLP. I believe that they are a great scaffold for trying out new techniques. This paper falls into that theme and could potentially be extended to something like translation.

--

--