Language Modeling Survey
Notes on Exploring the Limits of Language Modeling
TL;DR Use a large, regularized LSTM language model with projection layers and Softmax approximation using importance sampling, trained on a large dataset, to beat state-of-the-art LM results.
Overall impression: I am a fan of this paper. It was thorough, well-researched, and produced impressive results. It introduced a lot of new concepts to me, and did an above-average job of explaining them.
⁉️ Big Question
What is the most effective way for computers to reason about human language?
🏙 Background Summary
What work has been done before in this field to answer the big question? What are the limitations of that work? What, according to the authors, needs to be done next?
As a survey paper, the background is predictably detailed.
Language Modeling is the process of assigning distributions over sentences in human language in an effort to accurately encode meaning and complexity. In simpler words, it’s accurately predicting the probability of the next word given a sentence. It is a fundamental task of NLP, and powers many other areas such as speech recognition, machine translation, and text summarization.
Ironically, I’m having trouble language-modeling this sentence: “models that are able to assign a low probability to sentences that are grammatically correct but unlikely may help other tasks in fundamental language understanding”. Unsure what the authors are trying to say here. Does “unlikely” mean “unlikely to occur” or is there a missing word in there?
Traditional models rely heavily on n-gram features, but recently neural network approaches combined with n-gram features have been producing the most promising results. However, these models are often trained on smaller datasets than their n-gram counterparts. This produces misleading results and models that fall apart when applied to larger datasets.
Nit: is LM “Language Modeling” or “Language Models”? It’s defined as the former, but in the rest of this section “LMs” are mentioned…“language modelings”?
The authors mainly build upon the work of LSTM RNNs for language modeling, due to their ability to model long-term dependencies. They also emphasize that they will train their model on the One Billion Word benchmark, which is considerably larger than the “standard” PTB dataset; they make an analogy to ImageNet, which pushed computer vision forward by providing a much larger dataset to train models on.
Maybe I’m missing something here, but it’s surprising that researchers would go for the PTB dataset when larger ones are available. This doesn’t seem like something that should be actively called out. Is this just slow adjustment? Laziness? Or some other reason?
Secondarily, the authors recognize convolutional embedding models and softmax over large vocabularies as background for their work in this paper.
Convolutional embedding models offer character-level insights by passing 1-d convolutions over input sequences and max pooling to extract high-signal features.
One problem with dealing with large vocabularies is the computational bottleneck of assigning probabilities over million-word vocabularies. The authors examine various solutions to this problem in their work, focusing on importance sampling and Noise Contrastive Estimation.
❓ Specific question(s)
- What is the state of the art in language modeling using neural networks?
- Which existing language concepts can be combined and improved upon to produce better results?
The first problem the authors address is the large scale softmax problem. They compare and contrast two approaches — Noise Contrastive Estimation and Importance Sampling (defined under “Words I Don’t Know”).
They boil the difference between these two approaches into the following: NCE defines a binary classification between “true” and “noise” words with log-loss, while IS defines a multiclass classification with cross-entropy loss.
The approach they use in their architecture is to use k noise samples and one data distribution sample (this part is from NCE) and train a multiclass loss over a multinomial random variable (this part is from IS).
Perhaps necessary, but ignoring Hierarchical Softmax is certainly a cut corner; we have seen hierarchical softmax handle this problem well in the past, such as in the HM-RNN model
With that out of the way, we dive into model architectures, outlined in the diagram above.
Column (b) represents the first proposed change to a basic RNN LM. The authors aim to further address the computational complexity of the softmax function by pre-computing word embeddings from the component character embeddings.
So, this reduces the number of parameters from a multiplier of the vocabulary size down to the character set size?
This approach initially caused the network to have trouble differentiating between words with similar spellings but different meanings; the authors added a corr_w correction clause to each word in order to correct this.
This reminds me of adding biases on top of embeddings for movie ratings in collaborative filtering (see here); some users rate all movies high and some users rate all movies low; adding a term to differentiate between them is an effective way of reducing the noise.
A benefit of using this approach to word embeddings is the ability to score out-of-vocabulary words.
Confused here — aren’t the corr_w embeddings word-specific? An unknown word wouldn’t have an associated corr_w embedding, right? So wouldn’t the model still effectively be restricted to the training vocabulary?
Even with these improvements, the network is still practically slow. Another possible approach is utilizing char-LSTMs, which make predictions character by character. These are much more efficient but their performance is unacceptably worse. The authors propose combining word- and character-LSTMs by feeding the word-level hidden state into a char-LSTM.
This model scales independently of vocabulary size, but unfortunately accuracy is still subpar. These are difficult problems to solve!
The authors tried many different variations of RNN LM architectures using the techniques previously outlined, including CNN Softmax and char-LSTM, models with and without dropout, and models with and without a projection layer.
They ran all tests using TensorFlow on the 1B Word Benchmark (~ 0.8B words with a vocabulary ~ 800k). For word models, no preprocessing was done, and for character models begin- and end-of-word tokens are added to separate words.
Their hyperparameter choices were sane and, more importantly, consistent: Adagrad optimizer, learning rate of 0.2, batch size of 128, and weight clipping at 1.0.
❤️ authors who include these details; it makes their results infinitely more replicable and trustworthy.
They used 8192 samples from each step to approximate the softmax function. This reduces the factor from 800k, a 100x decrease.
These are the results the authors uncovered, with novel architectures below existing ones in each table.
The size of this table is a friendly reminder about the empirical state of deep learning research — there’s a lot of great ideas, and until they’re all tried it’s impossible to say which will do best!
The previous state of the art complexity on the 1B word benchmark was 51.3. The best single model in this paper beat it with a perplexity of 30.0, and an ensemble of models from this paper scored 23.7!
It looks like the 23.7 perplexity was achieved by combining the authors’ models with the SNM-10 SKIP model. At the end of section 5.7, the authors write “Our results, on the contrary, suggest that N-grams are of limited benefit, and suggest that a carefully trained LSTM LM is the most competitive model.”. How do these two items reconcile with each other?
The authors’ takeaways are as follows:
- More parameters = better results; this makes sense, since more parameters allows more room to encode complexity.
- Dropout helps: all models, even small ones, can overfit on such an immense data set. Increasing dropout mitigates this. For example, there is a perplexity difference of 5+ between LSTM-8192–2048 with and without dropout.
- Importance sampling helps large softmaxes: the table below shows that IS models take significantly less time to converge on a good test perplexity.
This calls section 3.1 into question a bit for me and reveals my poor understanding: it seems like Importance Sampling and NCE are treated separately here, so why did I get the sense from section 3.1 that the authors were attempting to combine the two? In addition, it seems like IS clearly outperforms NCE.
- No-cost character embeddings: without any cost to accuracy, character embeddings can be precomputed and used to replace word embeddings. This provides the ability to process arbitrary words.
- The size-performance tradeoff is feasible: using the CNN softmax technique instead of the IS/NCE softmax, the authors achieved a perplexity of 35.8 with just 400M parameters, which is quite comparable to the LSTM-8192–2048 containing 8x as many parameters.
- Training is fast: the LSTM-2048–512 model beat the previous state of the art in just 2 hours!
What’s the point of reference here? (How long does it take to train the baseline models?)
- Ensembling is good: the best results were achieved with averages of several models, resulting in a world-beating 23.7 test perplexity.
I would say these results emphatically answer the specific questions. The best results the authors achieved are quite eye-opening to me.
Side note: I’m unsure what tradition dictates, but do survey papers generally propose novel ideas, or is the hybrid approach in this paper less common? I really like it!
What do the authors think the results mean? Do you agree with them? Can you come up with any alternative way of interpreting them? Do the authors identify any weaknesses in their own study? Do you see any that the authors missed? (Don’t assume they’re infallible!) What do they propose to do as a next step? Do you agree with that?
The authors interpret these results to show that they’ve effectively beaten the state of the art on the 1B Word Benchmark using the following techniques:
- large, regularized LSTM LM
- projection layers
- Softmax approximation using importance sampling
The authors also explored other approaches, such as char-CNN, in an effort to compare and contrast them.
A key takeaway for the authors is the universal applicability of a larger dataset. They encourage further language modeling research to use larger corpuses of text to achieve better results.
It might be too much for one paper, but I would have liked to at least see an attempt at applying these LM models to the tasks the authors mention in the introduction, such as speech recognition and text summarization. How much does an improved language model really affect these higher-level tasks?
This emphasis on dataset echoes what seems to me like the most pressing issue in deep learning right now: collecting data. For example, Andrew Ng’s $150M AI fund has stated that the majority of their funding will go towards data.
I am a big fan of this paper. I thought it was quite thorough and well-researched, and obviously it produced impressive results. It introduced a lot of new concepts to me, and did a reasonable job of explaining them.
⏩ Viability as a Project
As it stands, this paper would be most useful to use as an underlying model for the tasks that are built on top of LM, such as speech recognition, machine translation, and text summarization.
The authors released all of their training data and code, so there is a solid foundation to extend this research.
It seems like this paper could be useful for the text classification piece of the Personalized Medicine Kaggle challenge, but I’m unsure exactly how at this point.
For the most part, the abstract matches what is presented in the paper. It calls out corpora sizes and long-term language structure as the two main focuses of the paper, but I feel like the former was emphasized more.
🗣 What do other researchers say?
- Delip counters the argument that n-gram models are obsoleted by neural networks, maintaining that n-gram models are still the best option for smaller datasets.
📚 Other Resources
🤷 Words I don’t know
- perplexity: a metric to compare approaches for LM tasks; the perplexity of whatever you’re evaluating, on the data you’re evaluating it on, sort of tells you “this thing is right about as often as an x-sided die would be.” In the context of LM, this is the average per-word log-probability on the holdout data set.
- parametric approach: a parametric model is a family of distributions that can be described using a finite number of parameters. A non-parametric model assumes that the data cannot be defined with a finite set of parameters
- highway network: deep neural network that allows the flow of information along several different “highways” instead of just from one layer to the next (?)
- logit: inverse of sigmoid function
- Noise Contrastive Estimation: a proposed solution to the large scale softmax problem which approximates the softmax with a binary classification task between data and noise
- Importance Sampling: a proposed solution to the large scale softmax problem which estimates the softmax from an approximate example
- projection layer: linear layer between hidden states of an LSTM to move the LSTm output to the probability output