Universal Language Model Fine-tuning for Text Classification — ULMFiT

20 min readMay 18, 2019

Postedit: I recently launched a email list to share the best machine learning tools and research, if you are interested, sign up here! AI Newsletter. Any comments, feedback, or suggestions on this article would be greatly appreciated!

Theory

Introduction

Natural Language Processing (NLP) needs no introduction in today’s world. It’s one of the most important fields of study and research, and has seen a phenomenal rise in interest in the last decade. The basics of NLP are widely known and easy to grasp. But things start to get tricky when the text data becomes huge and unstructured.

That’s where deep learning becomes so pivotal. DL has proven its usefulness in computer vision tasks like image detection, classification and segmentation, but NLP applications like text generation and classification have long been considered fit for traditional ML techniques.

And deep learning has certainly made a very positive impact in NLP. We will focus on the concept of transfer learning and how we can leverage it in NLP to build incredibly accurate models using the popular fastai library. I will introduce you to the ULMFiT framework as well in the process.

The Real Reason Why NLP is Hard

The process of reading and understanding language is far more complex than it seems at first glance. There are many things that go in to truly understanding what a piece of text means in the real-world. For example, what do you think the following piece of text means?

“Steph Curry was on fire last night. He totally destroyed the other team”

To a human it’s probably quite obvious what this sentence means. We know Steph Curry is a basketball player; or even if you don’t we know that he plays on some kind of team, probably a sports team. When we see “on fire” and “destroyed” we know that it means Steph Curry played really well last night and beat the other team.

Computers tend to take things a bit too literally. Viewing things literally like a computer, we would see “Steph Curry” and based on the capitalization assume it’s a person, place, or otherwise important thing which is great! But then we see that Steph Curry “was on fire”…. A computer might tell you that someone literally lit Steph Curry on fire yesterday! … yikes. After that, the computer might say that Mr. Curry has physically destroyed the other team…. they no longer exist according to this computer… great…

But not all is grim! Thanks to Machine Learning we can actually do some really clever things to quickly extract and understand information from natural language! Let’s see how we can do that in a few lines of code with a couple of simple Python libraries.

The Advantage of Transfer Learning

I praised deep learning in the introduction, and deservedly so. However, everything comes at a price, and deep learning is no different. The biggest challenge in deep learning is the massive data requirements for training the models. It is difficult to find datasets of such huge sizes, and it is way too costly to prepare such datasets. It’s simply not possible for most organizations to come up with them.

Another obstacle is the high cost of GPUs needed to run advanced deep learning algorithms.

Thankfully, we can use pre-trained state-of-the-art deep learning models and tweak them to work for us. This is known as transfer learning. It is not as resource intensive as training a deep learning model from scratch and produces decent results even on small amounts of training data.

Pre-trained Models in NLP

Pre-trained models help data scientists start off on a new problem by providing an existing framework they can leverage. You don’t always have to build a model from scratch, especially when someone else has already put in their hard work and effort! And these pre-trained models have proven to be truly effective and useful in the field of computer vision.

Their success is popularly attributed to the Imagenet dataset. It has over 14 million labeled images with over 1 million images also accompanying bounding boxes. This dataset was first published in 2009 and has since become one of the most sought-after image datasets ever. It led to several breakthroughs in deep learning research for computer vision, with transfer learning being one of them.

However, in NLP, transfer learning has not been as successful (as compared to computer vision, anyway). Of course we have pre-trained word embeddings like word2vec, GloVe, and fastText, but they are primarily used to initialize only the first layer of a neural network. The rest of the model still needs to be trained from scratch and it requires a huge number of examples to produce a good performance.

What do we really need in this case? Like the aforementioned computer vision models, we require a pre-trained model for NLP which can be fine-tuned and used on different text datasets. One of the contenders for pre-trained natural language models is the Universal Language Model Fine-tuning for Text Classification, or ULMFiT .

How does it work? How widespread are it’s applications? How can we make it work in Python?

Overview of ULMFiT

Proposed by fast.ai’s Jeremy Howard and NUI Galway Insight Center’s Sebastian Ruder, ULMFiT is essentially a method to enable transfer learning for any NLP task and achieve great results. All this, without having to train models from scratch.

ULMFiT achieves state-of-the-art result using novel techniques like:

⦁ Discriminative fine-tuning

⦁ Slanted triangular learning rates

⦁ Gradual unfreezing

This method involves fine-tuning a pre-trained language model (LM), trained on the Wikitext 103 dataset, to a new dataset in such a manner that it does not forget what it previously learned.

Language modeling can be considered a counterpart of Imagenet for NLP. It captures general properties of a language and provides an enormous amount of data which can be fed to other downstream NLP tasks. That is why Language modeling has been chosen as the source task for ULMFiT.

Building Blocks of ULMFiT

AWD-LSTM Here is an overview of our encoder + classifier model.

Above is the layer by layer exposure of ULMFIT in fast.ai library. Model is composed of pretrained encoder (MultiBatchRNNCore — module 0) and a custom head, e.g. for binary classification task in this case (PoolingLinearClassifier — module 1).

Architecture that ULMFIT uses for it’s language modeling task is an AWD-LSTM. The name is an abbreviation of ASGD Weight-Dropped LSTM.

If you go back and look at the ULMFIT architecture exposed earlier, you may one by one link the modules listed below. To simplify everything I will assume a vocabulary of V = [“I”, “went”, “to”, “trouble”, “of”, “exceling”, “for”, “you”, “guys”, “:)” ]. Don’t be surprised to see the smiley — :) as another token, any information you can represent is valuable. Tokenization is another important aspect in NLP.

Embedding Lookup Table with n dimensions

Encoder Dropout (EmbeddingDropout()): Zeroing embedding vectors randomly.

encoder_dp module

2.Input Dropout (RNNDropout()): Zeroing embedding lookup outputs randomly.

N.B. Here, the order of inputs fed into LSTM is batch size x sequence length x embedding size at each forward pass.

Dropout is one of the de-facto ways for regularizing deep learning models and preventing complex co-adaptations of hidden units (weights), a.k.a overfitting.

AWD-LSTM paper emphasizes that until contribution applying naive-classical dropout techniques on a LSTM for language modeling tasks were ineffective as it disrupts the ability to retain long term dependencies or the effective dropout methods that might work would require inefficient re-implementations of CUDNN LSTM. Retaining long term dependencies is a very key point, after all the main differentiator of an LSTM over an RNN in practice is the capability of carrying out longer term dependencies by learning how much information to flow through the sequence.

Let’s try to imagine what happens if we use a naive dropout layer on input sequences fed to LSTM and further convince ourselves why might it not be effective in practice.

If you go to PyTorch documents the plain dropout layer is defined as:

During training, randomly zeroes some of the elements of the input tensor with probability p using samples from a Bernoulli distribution. Each channel will be zeroed out independently on every forward call. What we actually want to achieve is to zero elements in a way that same parts of hidden units are disabled throughout the sequence. After all at each forward pass by applying dropout we introduce a partially masked form of our model — f(x), and masked weights should be consistent for each and every input’s forward pass.

As you can see above, the naive method will not ensure to disable the same weights of the RNN layer, in this case it’s input-to-hidden, but in contrary RNNDropout() will give the consistency throughout the whole sequence. In computer vision it’s trivial to randomly mask inputs as the forward pass will be stateless but in the case of recurrent networks we should carry out the exact same masked weights. I hope that now it’s a bit more clear why retaining longer term dependencies with naive method might be hard.

Rather than memorizing the functions let’s have this image here just in case :)

3. Weight Dropout(WeightDropout()) : Apply dropout to LSTM weights.

Previously in Input Dropout we’ve seen how dropout is applied to each input sequence independently but consistently throughout the sequence. In Weight Dropout dropout is applied to hidden-to-hidden matrix by zeroing out hidden units randomly. Weight masks are randomly refreshed at each batch-forward pass.

Weight Dropout is a generic dropout layer, which wraps a PyTorch module and zeroes the weight randomly given the name of the layer. So you can theoretically use it for any custom layer given the named attributes. Here you can notice that weight_ih_l0 is not registered for weight dropout.

My logical assumption would be, since input dropout is already present, adding another layer of dropout for input-to-hidden weight would be unnecessary and would also make it harder to control the level of dropout we desire over the input-to-hidden transformation. Also the ability to add randomness to every sample is much powerful than doing it only every batch-forward pass.

4. Hidden Dropout (RNNDropout()): Zeroing outputs of LSTM layers. This dropout is applied except for the last LSTM layer output.

This dropout layer randomly zeroes out raw outputs from previous LSTM layers in a stacked LSTM module. It can be thought as zeroing out the inputs of the stacked LSTM layers, e.g. in a 3 layer LSTM, the inputs to LSTM-2 and LSTM-3 would be randomly zeroed.

As it’s done in Input Dropout here in order to be consistent throughout the sequence the same parts are zeroed as before feeding to hidden features to the next LSTM layer. Same logic we’ve discussed in original sequence input applies to inputs of the upper LSTM layers.

5. Output Dropout (RNNDropout()): Zeroing final sequence outputs from encoder before feeding it to decoder.

This module is not visible when we look at the full ULMFIT classification model since this dropout is used before generating next word predictions by the language model decoder. Language model decoder is nothing but a fully connected linear layer which transforms outputs of encoder to token predictions in our vocabulary. As we are dealing with a sequential input to the decoder linear layer keeping sequential dropout consistency is important.

Main Takeaway:

Ensure to apply dropout throughout a single input sequence in a consistent way and apply it anywhere possible. Apply random dropout per input sequence. Also apply dropout on hidden-to-hidden weight of LSTM. Apply random dropout per batch-forward pass.

Wrapping Everything Up

Since all the dropout layers are explained we can put it all together and visualize how each and every one of these take place.

Let’s define a dummy batch with the following sequences :

batch = [“I love cats”, “I love dogs”]

First step is to initialize an embedding matrix for each token and apply the encoder dropout.

Then we will do a look up for our batch and apply the input dropout.

Next step is to allow random dropout at hidden-to-hidden weights for each LSTM layer. Here I am defining a 2 layer LSTM with 5 hidden units

You may notice how same tokens, like “love”, have different input features after input dropout. Here Whh is hidden-to-hidden weight and it’s shared throughout the full sequence forward pass.

Lastly we take a look at hidden dropout which is the input for the upper layer LSTMs in the stack.

If we want to look at output dropout, you will notice it’s nothing but the same type of dropout applied to final raw outputs of the encoder stacked LSTM which noted as decoder_inp {1,2,3}.

Variable Length BPTT

Adding dropout to everywhere isn’t the only regularization adapted by ULMFIT from AWD-LSTM and there is much more to see from these bag-of-tricks. Let’s take a step back from classification and explore language modeling part since it’s the first part of the whole fine-tuning process. Even though ULMFIT already offers a pretrained language model on English Wikitext-103, we still need to fine tune it for our custom dataset in order to better adapt to the linguistic properties of our corpus. You would agree that language used in movie reviews would differ a lot from the language that legal documents have even though they are both in English.

In the original AWD-LSTM paper authors emphasize on how dataset usage in traditional language modeling is inefficient. The reason being having a fixed window to back-propagate will always have the same words contributing for the update with same weight of gradients flowing from last word to the first. Randomizing windows selected at each step is what is meant by variational length BPTT. This randomization acts as a regularization method and allows a theoretical 100% utilization of the dataset.

The way this phenomena implemented in fast.ai or at least using dataset efficiently is as following:

All the text is concatenated as a one big array, you may think it as a stream of text. Then it’s chunked into pieces of batch size sequence and each sequence is BPTT long. A special callback class called LanguageModelPreLoader takes care of all the utilization and the optimization. It shuffles the index and adds randomness to the starting index of the sequence selected which allows a theoretical usage of all the tokens to start with.

MultiBatchEncoder

We’ve seen how dataset utilization is allowed in language modeling, now let’s take a look at another smart way of processing a full sentence for classification task. MultiBatchEncoder() is a wrapper over the encoder which allows processing a full sentence but at the same time respecting the BPTT that we’ve specified. Again we can visualize this using spreadsheets.

We will have the following batch for simplicity:

batch = [“I love cats very much”, “I love dogs very much”], bptt = 3, output_dim=2

What we will do is to slide over this batch along the sequence dimension by multiples of bptt and at the end to concatenate outputs from the encoder. This will give as batch x sequence length x output_dim (emb_size if tie_weights).

Top Down Visualization of How MultiBatchEncoder Works

This allows you to process a full sentence by recursively passing chunks of sequences to the encoder (stacked LSTM). You might ask why can’t you just pass the whole sequence but set a BPTT instead? One reason is vanishing an exploding gradients due to too many recursive operations while updating the encoder. Another one is GPU memory allocation for storing gradients, more recursive operations means a larger computational graph. Lastly, having a smaller graph means less operations; faster training and cheaper update for a given parameter.

QRNN

Earlier we’ve discussed how ULMFIT is constructed with stacked LSTMs and even bidirectional LSTMs (only difference being a concatenated output from both sides). Still there is one more encoder layer that you might consider instead of LSTM, QRNN — Quasi-Recurrent Neural Network, a work from some of the same authors from AWD-LSTM. Main motivation of QRNN is to introduce convolutions to parallelize the operations which is not possible with dependent LSTM layers. It is also stated that QRNN not only allows 16x faster computation speed but also has better performance compared to stacked LSTM in various tasks such as language modeling, sentiment classification, and character-level neural machine translation.

CNNs have been used for sequence related tasks and it’s also known that they are inherently good at extracting features such as n-grams. Say you have a 2D CNN layer with a kernel size of 3x300 [n_seq x emb_size], this layer would allow you to extract features from 3 consecutive word tokens, or in other words 3-gram features.

A view of how multiple filters applied for 2–3–4 gram features

But the nature of how these convolutions operate don’t allow them to utilize the full order information of the sequence. Even with state-of-art attention language models such as BERT order is encoded in a way to be consumed as an additional information and it’s known to be highly valuable.

QRNN addresses some of the problems CNNs and RNNs have; convolutions being time invariance and LSTMs being non-parallelized. We can say that QRNN combines best of two worlds: parallel nature of convolutions and time dependencies of LSTMs.

LSTM, CNN and Q

LSTM, CNN and QRNN architectures compared

First set of operations in a QRNN layer we will output Z — candidate output, F — forget gate and O — output gate.

In language modeling tasks convolution filters should only operate on previous tokens since the task is to predict the next token. QRNN uses a technique called masked convolutions which was introduced in Pixel RNN from Google Deep Mind in order to do so. This allows to build an autoregressive which doesn’t leak information from the future.

In QRNN paper masked convolutions said to be implemented by padding the input to the left by kernel size minus one. Let’s take a look at an example. Again we are starting of with a dummy sequence after embedding lookup [“I”, “love”, “cats”, “and”, “dogs”].

Wz, Wf and Wo for generate gate outputs

We extract Z, F, O gates by doing a padded sum product of transposed Wz, Wf and Wo

Then apply tanh, sigmoid and sigmoid activations respectively

The operations above give us Z, F, O gates. Now we will look at how fo-pooling takes place. This is the part where time dependency takes place.

fo-pooling equation

There is also an option to use f-pooling which discards output gate by setting output_gate=False .

Although the recurrent parts of these functions must be calculated for each timestep in sequence, their simplicity and parallelism along feature dimensions means that, in practice, evaluating them over even long sequences requires a negligible amount of computation time.

Let’s have a look how ct and ht are computed at each step:

sequential ct and hn generation

In order to capture larger n-grams kernel size can be increased when computing Z,F,O. Similarly to approximate more complex functions, multiple layers can be stacked just like stacked LSTMs. Multiple convolution plus pooling will surely give more powerful models. You may check this for a better understanding of universal approximation theorem. Yet we don’t want our model to overfit so we will still need some sort of regularization within the model itself.

Earlier while explaining AWD-LSTM we’ve seen how dropout scheme is important for not to disrespect the long term dependencies and how it’s important to have consistent dropout throughout the sequence forward pass. Due to QRNN’s lack of recurrent weights authors suggest applying dropout during the calculation of F gate because as you may also notice F gate is shared across all timestep calculations.

Dropout amount is also referred as “zone out” in the original paper

Here is the modified version of our sample illustration with dropout added

Remember that other dropout methods that we’ve seen for AWD-LSTM are also applicable here and might be good for further regularizing your model. QRNN is just an optional replacement for LSTM layers.

In fast.ai implementation, the following matrices [Wz, Wf, Wo] are concatenated into a one matrix for performance considerations. WeightDropout() is applied to this matrix.

Sample FastAi implementation

Alright, enough theoretical concepts — let’s get our hands dirty by implementing ULMFiT on a dataset and see what the hype is all about.

Our objective here is to fine-tune a pre-trained model and use it for text classification on a new dataset. We will implement ULMFiT in this process. The interesting thing here is that this new data is quite small in size (<1000 labeled instances). A neural network model trained from scratch would overfit on such a small dataset. Hence, I would like to see whether ULMFiT does a great job at this task as promised in the paper.

Import Required Libraries:

For manual dataset import from local machine we’ll have to follow below code.

Let’s create a dataframe consisting of the text documents and their corresponding labels (newsgroup names).

Shape of dataframe

We’ll convert this into a binary classification problem by selecting only 2 out of the 20 labels present in the dataset. We will select labels 1 and 10 which correspond to ‘comp.graphics’ and ‘rec.sport.hockey’, respectively.

Quick look at the target distribution

The distribution looks pretty even. Accuracy would be a good evaluation metric to use in this case.

Data Preprocessing

It’s always a good practice to feed clean data to your models, especially when the data comes in the form of unstructured text. Let’s clean our text by retaining only alphabets and removing everything else.

Now, we will get rid of the stopwords from our text data. If you have never used stopwords before, then you will have to download them from the nltk package as I’ve shown below:

Now let’s split our cleaned dataset into training and validation sets in a 60:40 ratio.

Checking the shape of train and validation dataset.

Before proceeding further, we’ll need to prepare our data for the language model and for the classification model separately. The good news? This can be done quite easily using the fastai library:

Fine-Tuning the Pre-Trained Model and Making Predictions:

We can use the data_lm object we created earlier to fine-tune a pre-trained language model. We can create a learner object, ‘learn’, that will directly create a model, download the pre-trained weights, and be ready for fine-tuning:

The one cycle and cyclic momentum allows the model to be trained on higher learning rates and converge faster. The one cycle policy provides some form of regularization. We won’t go into the depth of how this works as this is about learning the implementation.

Conclusion

Hence we can take any text data and classify that text as per given labels using ULMFiT model. However for using your own data we have to be careful with dataset and use only text column and label column. I’ve also mentioned how to import your own dataset and use it, once you’ll start doing it, I assure you it’s going to be fun.

Next Steps: Testing with New Datasets

Lets try this model out with some new datasets!

Airline Sentiment Dataset

This dataset includes a customer tweets about an airline and sentiment labels. They label each tweet as neutral, positive, or negative. The dataset can be downloadded from here. The csv files were edited to only contain the label and tweet text columns.

We can see that the distribution is imbalanced. Therefore we should probably not just use accuracy as a performance metric. Lets also use precision, recall, and F1 score as well. These related to the number of true positives,true negatives, false positives, and false negatives.

Sentiment 140 Twitter Dataset

This dataset includes a variety of tweets and there corresponding sentiment labels. They also label each tweet as neutral, positive, or negative. The dataset can be downloaded from here. Only the sentiment label and tweet text columns were used.

Note: The label legend is 0 = negative, 2 = neutral, 4 = positive

Seems relatively balanced. Accuracy will work as a metric. We can include the other previously used metrics for curiosities sake as well.

Results for the Sentiment 140 Dataset

Accuracy is probably the best metric to look at here given that the number of instancs in each class were around the same. However the accuracy does not tell a good story as it is a decrease from what we have seen in the previous datasets especially the newspaper dataset. This shows the sentiment model not generalizing well. This performance may also be affected by the three classes in instead of two tested in the first dataset.