Paper Summary: Universal Language Model Fine-tuning for Text Classification

5 min readNov 19, 2018

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/18.

Universal Language Model Fine-tuning for Text Classification (2018) https://arxiv.org/abs/1801.06146 Jeremy Howard, Sebastian Ruder

Pre-training on ImageNet is standard for computer vision (CV) models, so why is there no standard way to do the same for NLP tasks? Several papers try to answer this question, some of which I’ve summarized in the last few days, but none is as universal as the CV equivalent. This paper promises such a universal solution — universal in the sense that we’d like a single architecture and training method, minimal hyperparameter tuning, minimal pre-processing requirements, and good results without boatloads of task- and domain-specific data. Do Howard and Ruder deliver? This certainly looks like a good start, but things are moving fast in NLP, so it’s anybody’s guess what standard practice will look like in a year or two.

The architecture and training method, ULMFiT (universal language model fine-tuning), builds on similar approaches (CoVe, ELMo) and methods (Merity 2017). In CoVe and ELMo the encoder layers are frozen. ULMFiT instead describes a way to train all layers, and does so without overfitting or running into “catastropic forgetting”, which has been more of a problem for NLP (vs CV) transfer learning in part because NLP models tend to be relatively shallow. Other early approaches (Dai and Le 2015) require large in-domain datasets; ULMFiT is by comparison extremely data efficient for in-domain data.

ULMFiT starts with the AWD-LSTM architecture from Merity 2017, which has no attention or residual connections. The novelty is in how the network is trained — which happens in 3 phases: unsupervised pre-training on a large corpus (Wikitext-103), “semi-supervised” language model fine-tuning on task-related data, and classifier fine-tuning on a possibly very small task-specific dataset. The first phase — unsupervised pre-training — is expensive and slow, but only has to happen once.

The second phase is target task LM fine-tuning. This can use a small dataset and be quite fast, since all that needs to happen here is for the model to adjust to a somewhat different distribution, not learn an entire LM from scratch. The keys to making this work are a couple of simple ideas: discriminative fine-tuning and slanted triangular learning rates. And these really are quite simple ideas — the difficulty, presumably, is in knowing which simple ideas to try and making them all work together.

Discriminative fine-tuning is as simple as using different learning rates for different layers — the authors found a good learning rate for the last layer empirically, then divided the learning rate by a factor of 2.6 for successive layers (why 2.6? who knows). The intuition here is that task-specific functionality tends to live in later layers, whereas early layers contain relatively unchanging meanings, part of speech info, etc.

Slanted triangular learning rates are an annealing schedule that looks like a single sawtooth: it quickly ramps up and then gradually decays:

Why does this schedule work? I’m speculating here, but I think the idea is that when you first encounter out-of-distribution data (which task fine-tuning will look like at first), you can expect high prediction error, which in turn means large gradients. Large gradients together with a large learning rate will clobber the delicately tuned existing weights. The ramp-up avoids this whiplash by slowly introducing the model to the new distribution before going full steam ahead.

The third phase is classifier fine-tuning. Here the network changes a bit: the authors add a classification head (two fully connected layers with batch norm and dropout, with ReLU and softmax respectively). This is standard practice for CV. There’s also the addition of max- and mean-pooling that aggregates across timesteps, on the theory that sentiment and other classification tasks hinge on a few important words in the input.

Like phase two, the classifier also uses discriminative learning rates and slanted triangular learning rates. On top of that, for an even gentler introduction to the new task, the authors add gradual unfreezing of layers, which is exactly what it sounds like: start with just the last layer unfrozen, then after each training epoch unfreeze another layer. (In case you’re wondering, yes, all of these training methods are helpful on their own and in concert.)

A couple more details: there’s an augmentation to basic BPTT where the recurrent model is initialized to the final state of the previous batch (which makes sense because the input data is arranged so that the text reads coherently from batch to batch). They combine this with variable-length BPTT as in Merity 2017. Also, they use an ensemble of a forward and backward LM, each of which is separately trained in the manner described above.

So results. ULMFiT handily beats CoVe and other modern approaches on a text classification task. There’s lots of ablation analysis, and as you might expect pretty much everything they include is helpful when considered singly — that is, degrades performance when removed. They also look at doing “low-shot” learning, so, using as few in-domain training examples as possible, with impressive results:

This was on the IMDb dataset. Here “semi-supervised” means that the model was fine-tuned on unlabeled in-domain data (50k examples for IMDb). The pre-training and fine-tuning are clearly quite helpful. They point out that training on just 100 labeled in-domain examples is as good as training from scratch on 10x — 20x as much labeled data (50x — 100x in the semi-supervised setup!).

So overall the results are quite impressive. But I’m left wondering whether this approach is really ready to be called universal, when it’s not clear how to apply the method to tasks like entailment and question answering. It’s also not clear to me what if anything they’re doing about out-of-vocabulary words, or whether they’re using subword information — maybe ELMo-like character convolutions would work here?

Dai and Le 2015 “Semi-supervised Sequence Learning” https://arxiv.org/abs/1511.01432

Merity et al 2017 “Regularizing and Optimizing LSTM Language Models” https://arxiv.org/abs/1708.02182

Paper Summary: Universal Language Model Fine-tuning for Text Classification

Written by Mike Plotz