Understanding building blocks of ULMFIT

Last week I had the time to tackle a Kaggle NLP competition: Quora Insincere Questions Classification. As it’s easy to understand from the name, the task is to identify sincere and insincere questions given the question text. In short it’s a binary classification problem.

I recently completed Fast.ai Part 1 (2019). Library evolved drastically over the last 1 year. It is lot easier and faster to use and it has an amazing documentation. Since it has such a great documentation I rarely have a problem that requires me to pop up a question in the forums, though forums are still great to learn things beyond the library itself. If you want to apply for Part 2 (2019) as a diversity fellow or sponsor here is the link.

If you don’t know it already, NLP had a huge hype of transfer learning in this past 1 year, starting with ULMFit, ELMo, GLoMo, OpenAI transformer, BERT and recently Transformer-XL for further improving language modeling capabilities of the current state of the art. I am not going to be able to explain all of these exceptional works except for ULMFIT. For a general overview for all the above models and more I encourage you to read this amazing blog post in order to get a deeper understanding in language modeling methodologies and advancements.

First, my aim is not to explain how or why ULMFIT became to be a state-of-the-art model because there are many resources like this fast.ai blog post, part 1 (2019) course videos, and many other blog posts floating around the internet. Rather, I will be digging deeper into references of ULMFIT and references of it’s references to get a better understanding of the fundamental building blocks that it adapts.

This might be little overwhelming for demystifying the underlings of this study but sometimes being little obsessed helps. I usually end up opening 87634172 tabs in my browser, reading the source code and spending quite some time bouncing back and forth between all these resources. I did the same when I got my hands dirty with ULMFIT for the Kaggle competition I’ve introduced in the beginning of this post. Since I’ve already spent quite some time digging deeper, I thought it would be cool to share what I distilled from it.

High level idea of ULMFIT is to train a language model using a very large corpus like Wikitext-103 (103M tokens), then to take this pretrained model’s encoder and combine it with a custom head model, e.g. for classification, and to do the good old fine tuning using discriminative learning rates in multiple stages carefully.

You should already be familiar with all of these, after all the aim here is to dig a little deeper.

if not: print("shame :)"); open(this).read() or open(this).read()
else: continue


Here is an overview of our encoder + classifier model.

(0): MultiBatchEncoder(
(module): AWD_LSTM(
(encoder): Embedding(60003, 300, padding_idx=1)
(encoder_dp): EmbeddingDropout(
(emb): Embedding(60003, 300, padding_idx=1)
(rnns): ModuleList(
(0): WeightDropout(
(module): LSTM(300, 1150, batch_first=True)
(1): WeightDropout(
(module): LSTM(1150, 1150, batch_first=True)
(2): WeightDropout(
(module): LSTM(1150, 300, batch_first=True)
(input_dp): RNNDropout()
(hidden_dps): ModuleList(
(0): RNNDropout()
(1): RNNDropout()
(2): RNNDropout()
(1): PoolingLinearClassifier(
(layers): Sequential(
(0): BatchNorm1d(900, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): Dropout(p=0.4)
(2): Linear(in_features=900, out_features=50, bias=True)
(3): ReLU(inplace)
(4): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): Dropout(p=0.1)
(6): Linear(in_features=50, out_features=2, bias=True)

Above is the layer by layer exposure of ULMFIT in fast.ai library. Model is composed of pretrained encoder (MultiBatchRNNCore — module 0) and a custom head, e.g. for binary classification task in this case (PoolingLinearClassifier — module 1).

Architecture that ULMFIT uses for it’s language modeling task is an AWD-LSTM. The name is an abbreviation of ASGD Weight-Dropped LSTM.

Here is a hint about maybe the most important differentiator of this model in form of a meme:

AWD-LSTM literally has dropout at all the possible layers as long as it makes sense. Fast.ai docs precisely explains all of them (I also encourage you to check out the documentation):

If you go back and look at the ULMFIT architecture exposed earlier, you may one by one link the modules listed below. To simplify everything I will assume a vocabulary of V = [“I”, “went”, “to”, “trouble”, “of”, “exceling”, “for”, “you”, “guys”, “:)” ]. Don’t be surprised to see the smiley — :) as another token, any information you can represent is valuable. Tokenization is another important aspect in NLP which will not be covered here but for those who are particularly interested may visit here.

Embedding Lookup Table with n dimensions
  1. Encoder Dropout (EmbeddingDropout()): Zeroing embedding vectors randomly.
encoder_dp module

2.Input Dropout (RNNDropout()): Zeroing embedding lookup outputs randomly.

N.B. Here, the order of inputs fed into LSTM is batch size x sequence length x embedding size at each forward pass.

Dropout is one of the de-facto ways for regularizing deep learning models and preventing complex co-adaptations of hidden units (weights), a.k.a overfitting.[1,2]

AWD-LSTM paper emphasizes that until contribution applying naive-classical dropout techniques on a LSTM for language modeling tasks were ineffective as it disrupts the ability to retain long term dependencies or the effective dropout methods that might work would require inefficient re-implementations of CUDNN LSTM. Retaining long term dependencies is a very key point, after all the main differentiator of an LSTM over an RNN in practice is the capability of carrying out longer term dependencies by learning how much information to flow through the sequence.[3]

Let’s try to imagine what happens if we use a naive dropout layer on input sequences fed to LSTM and further convince ourselves why might it not be effective in practice.

If you go to PyTorch documents the plain dropout layer is defined as:

During training, randomly zeroes some of the elements of the input tensor with probability p using samples from a Bernoulli distribution. Each channel will be zeroed out independently on every forward call.

What we actually want to achieve is to zero elements in a way that same parts of hidden units are disabled throughout the sequence. After all at each forward pass by applying dropout we introduce a partially masked form of our model — f(x), and masked weights should be consistent for each and every input’s forward pass.

# shape : (1, 3, 5) - (bs, seq_len, emb_len)
# taking a single sentence like "I love dogs" as an example
# case 1: naive dropout 
tensor([[ 1.7339, 0.2212, -2.5182, -0.0000, 2.2808],
[-0.6162, 0.4100, -0.6159, -1.3283, 1.8158],
[-2.5599, -1.5934, -2.1049, 0.8675, -0.6511]])
# case 2: special RNN dropout
tensor([[ 1.7339, 0.0000, -0.0000, -1.7596, 2.2808],
[-0.6162, 0.0000, -0.0000, -1.3283, 1.8158],
[-2.5599, -0.0000, -0.0000, 0.8675, -0.6511]])

As you can see above, the naive method will not ensure to disable the same weights of the RNN layer, in this case it’s input-to-hidden, but in contrary RNNDropout() will give the consistency throughout the whole sequence. In computer vision it’s trivial to randomly mask inputs as the forward pass will be stateless but in the case of recurrent networks we should carry out the exact same masked weights. I hope that now it’s a bit more clear why retaining longer term dependencies with naive method might be hard.

Rather than memorizing the functions let’s paste this image here just in case :)

3. Weight Dropout(WeightDropout(): Apply dropout to LSTM weights.

Previously in Input Dropout we’ve seen how dropout is applied to each input sequence independently but consistently throughout the sequence. In Weight Dropout dropout is applied to hidden-to-hidden matrix by zeroing out hidden units randomly. Weight masks are randomly refreshed at each batch-forward pass.

# PyTorch LSTM weights :(rnn.weight_ih_l0, rnn.weight_hh_l0)
rnn = nn.LSTM(5,10,1, bidirectional=False, batch_first=True)

Weight Dropout is a generic dropout layer, which wraps a PyTorch module and zeroes the weight randomly given the name of the layer. So you can theoretically use it for any custom layer given the named attributes. Here you can notice that weight_ih_l0 is not registered for weight dropout.

My logical assumption would be, since input dropout is already present, adding another layer of dropout for input-to-hidden weight would be unnecessary and would also make it harder to control the level of dropout we desire over the input-to-hidden transformation. Also the ability to add randomness to every sample is much powerful than doing it only every batch-forward pass.

WeightDropout(rnn, weight_p=0.5, layer_names=['weight_hh_l0'])

4. Hidden Dropout (RNNDropout()): Zeroing outputs of LSTM layers. This dropout is applied except for the last LSTM layer output.

This dropout layer randomly zeroes out raw outputs from previous LSTM layers in a stacked LSTM module. It can be thought as zeroing out the inputs of the stacked LSTM layers, e.g. in a 3 layer LSTM, the inputs to LSTM-2 and LSTM-3 would be randomly zeroed.

As it’s done in Input Dropout here in order to be consistent throughout the sequence the same parts are zeroed as before feeding to hidden features to the next LSTM layer. Same logic we’ve discussed in original sequence input applies to inputs of the upper LSTM layers.

5. Output Dropout (RNNDropout()): Zeroing final sequence outputs from encoder before feeding it to decoder.

This module is not visible when we look at the full ULMFIT classification model since this dropout is used before generating next word predictions by the language model decoder. Language model decoder is nothing but a fully connected linear layer which transforms outputs of encoder to token predictions in our vocabulary. As we are dealing with a sequential input to the decoder linear layer keeping sequential dropout consistency is important.

Main Takeaway:

  • Ensure to apply dropout throughout a single input sequence in a consistent way and apply it anywhere possible. Apply random dropout per input sequence.
  • Also apply dropout on hidden-to-hidden weight of LSTM. Apply random dropout per batch-forward pass.

Wrapping Everything Up

Since all the dropout layers are explained we can put it all together and visualize how each and every one of these take place.

Let’s define a dummy batch with the following sequences :

batch = [“I love cats”, “I love dogs”]

  • First step is to initialize an embedding matrix for each token and apply the encoder dropout
  • Then we will do a look up for our batch and apply the input dropout
  • Next step is to allow random dropout at hidden-to-hidden weights for each LSTM layer. Here I am defining a 2 layer LSTM with 5 hidden units

You may notice how same tokens, like “love”, have different input features after input dropout. Here Whh is hidden-to-hidden weight and it’s shared throughout the full sequence forward pass.

  • Lastly we take a look at hidden dropout which is the input for the upper layer LSTMs in the stack

If we want to look at output dropout, you will notice it’s nothing but the same type of dropout applied to final raw outputs of the encoder stacked LSTM which noted as decoder_inp {1,2,3}.

Variable Length BPTT

Adding dropout to everywhere isn’t the only regularization adapted by ULMFIT from AWD-LSTM and there is much more to see from these bag-of-tricks. Let’s take a step back from classification and explore language modeling part since it’s the first part of the whole fine-tuning process. Even though ULMFIT already offers a pretrained language model on English Wikitext-103, we still need to fine tune it for our custom dataset in order to better adapt to the linguistic properties of our corpus. You would agree that language used in movie reviews would differ a lot from the language that legal documents have even though they are both in English.

In the original AWD-LSTM paper authors emphasize on how dataset usage in traditional language modeling is inefficient. The reason being having a fixed window to back-propagate will always have the same words contributing for the update with same weight of gradients flowing from last word to the first. Randomizing windows selected at each step is what is meant by variational length BPTT. This randomization acts as a regularization method and allows a theoretical 100% utilization of the dataset.

The way this phenomena implemented in fast.ai or at least using dataset efficiently is as following:

All the text is concatenated as a one big array, you may think it as a stream of text. Then it’s chunked into pieces of batch size sequence and each sequence is BPTT long. A special callback class called LanguageModelPreLoader takes care of all the utilization and the optimization. It shuffles the index and adds randomness to the starting index of the sequence selected which allows a theoretical usage of all the tokens to start with.


We’ve seen how dataset utilization is allowed in language modeling, now let’s take a look at another smart way of processing a full sentence for classification task. MultiBatchEncoder() is a wrapper over the encoder which allows processing a full sentence but at the same time respecting the BPTT that we’ve specified. Again we can visualize this using spreadsheets.

We will have the following batch for simplicity:

batch = [“I love cats very much”, “I love dogs very much”], bptt = 3, output_dim=2

What we will do is to slide over this batch along the sequence dimension by multiples of bptt and at the end to concatenate outputs from the encoder. This will give as batch x sequence length x output_dim (emb_size if tie_weights)

Top Down Visualization of How MultiBatchEncoder Works

This allows you to process a full sentence by recursively passing chunks of sequences to the encoder (stacked LSTM). You might ask why can’t you just pass the whole sequence but set a BPTT instead? One reason is vanishing an exploding gradients due to too many recursive operations while updating the encoder. Another one is GPU memory allocation for storing gradients, more recursive operations means a larger computational graph. Lastly, having a smaller graph means less operations; faster training and cheaper update for a given parameter. For more on BPTT see this wonderful post. Be cautious what we’ve been referring as BPTT is referred as truncated BPTT in that post.


Earlier we’ve discussed how ULMFIT is constructed with stacked LSTMs and even bidirectional LSTMs (only difference being a concatenated output from both sides). Still there is one more encoder layer that you might consider instead of LSTM, QRNN — Quasi-Recurrent Neural Network, a work from some of the same authors from AWD-LSTM. Main motivation of QRNN is to introduce convolutions to parallelize the operations which is not possible with dependent LSTM layers. It is also stated that QRNN not only allows 16x faster computation speed but also has better performance compared to stacked LSTM in various tasks such as language modeling, sentiment classification, and character-level neural machine translation.[5]

CNNs have been used for sequence related tasks and it’s also known that they are inherently good at extracting features such as n-grams. Say you have a 2D CNN layer with a kernel size of 3x300 [n_seq x emb_size], this layer would allow you to extract features from 3 consecutive word tokens, or in other words 3-gram features.

A view of how multiple filters applied for 2–3–4 gram features

But the nature of how these convolutions operate don’t allow them to utilize the full order information of the sequence. Even with state-of-art attention language models such as BERT order is encoded in a way to be consumed as an additional information and it’s known to be highly valuable.

QRNN addresses some of the problems CNNs and RNNs have; convolutions being time invariance and LSTMs being non-parallelized. We can say that QRNN combines best of two worlds: parallel nature of convolutions and time dependencies of LSTMs.

LSTM, CNN and QRNN architectures compared

First set of operations in a QRNN layer we will output Z — candidate output, F— forget gate and O — output gate.

In language modeling tasks convolution filters should only operate on previous tokens since the task is to predict the next token. QRNN uses a technique called masked convolutions which was introduced in Pixel RNN from Google Deep Mind in order to do so. This allows to build an autoregressive which doesn’t leak information from the future.[7]

In QRNN paper masked convolutions said to be implemented by padding the input to the left by kernel size minus one. Let’s take a look at an example. Again we are starting of with a dummy sequence after embedding lookup [“I”, “love”, “cats”, “and”, “dogs”].

Wz, Wf and Wo for generate gate outputs
We extract Z, F, O gates by doing a padded sum product of transposed Wz, Wf and Wo
Then apply tanh, sigmoid and sigmoid activations respectively

The operations above give us Z, F, O gates. Now we will look at how fo-pooling takes place. This is the part where time dependency takes place.

fo-pooling equation

There is also an option to use f-pooling which discards output gate by setting output_gate=False .

Although the recurrent parts of these functions must be calculated for each timestep in sequence, their simplicity and parallelism along feature dimensions means that, in practice, evaluating them over even long sequences requires a negligible amount of computation time.

Let’s have a look how ct and ht are computed at each step:

sequential ct and hn generation

In order to capture larger n-grams kernel size can be increased when computing Z,F,O. Similarly to approximate more complex functions, multiple layers can be stacked just like stacked LSTMs. Multiple convolution plus pooling will surely give more powerful models. You may check this for a better understanding of universal approximation theorem. Yet we don’t want our model to overfit so we will still need some sort of regularization within the model itself.

Earlier while explaining AWD-LSTM we’ve seen how dropout scheme is important for not to disrespect the long term dependencies and how it’s important to have consistent dropout throughout the sequence forward pass. Due to QRNN’s lack of recurrent weights authors suggest applying dropout during the calculation of F gate because as you may also notice F gate is shared across all timestep calculations.

Dropout amount is also referred as “zone out” in the original paper
Here is the modified version of our sample illustration with dropout added

Remember that other dropout methods that we’ve seen for AWD-LSTM are also applicable here and might be good for further regularizing your model. QRNN is just an optional replacement for LSTM layers.

AWD_LSTM(vocab_sz:int, emb_sz:int, n_hid:int, n_layers:int, pad_token:int, bidir:bool=False, hidden_p:float=0.2, input_p:float=0.6, embed_p:float=0.1, weight_p:float=0.5, qrnn:bool=True)

In fast.ai implementation, the following matrices [Wz, Wf, Wo] are concatenated into a one matrix for performance considerations. WeightDropout() is applied to this matrix.

End Notes

I hope that I covered some of the important building blocks of ULMFIT and give you guys a better understanding through simple visualizations on how dropout is applied almost everywhere and how a QRNN layer works. For me, writing this blog post let me better understand how these state of the art NLP models work. So, I would suggest everyone to read the article, the source code, write about it and while doing so to try implementing these ideas yourself. For me using Microsoft Excel was easy and fast since all of these already are implemented in fast.ai.

For more stayed tune or just go explore the library and papers yourself. As I am about to publish this blog post I’ve seen that fast.ai started supporting Transformer and Transformer XL a day ago, two attention based models. I will try to do a benchmarking using all these options: AWD-LSTM, AWD-QRNN, Transformer and Transformer XL on Quora Insincere Questions Classification dataset and edit this post in the coming days.

Excel Work:

If you want to check out the spreadsheet work I’ve created a copy of it in Google sheets here. Since functions are not copied it’s static but might still be helpful for illustration purposes.


  1. Improving neural networks by preventing co-adaptation of feature detectors G. E. Hinton∗ , N. Srivastava, A. Krizhevsky, I. Sutskever and R. R. Salakhutdinov
  2. Dropout: A Simple Way to Prevent Neural Networks from Overfitting Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov
  3. Understanding LSTM Networks http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  4. Regularizing and Optimizing LSTM Language Models Stephen Merity, Nitish Shirish Keskar, Richard Socher
  5. Quasi-Recurrent Neural Networks James Bradbury, Stephen Merity , Caiming Xiong & Richard Socher
  6. A Convolutional Neural Network for Aspect Sentiment Classification Yongping Xing and Chuangbai Xiao and Yifei Wu and Ziming Ding
  7. Pixel Recurrent Neural Network Aaron van den Oord Nal Kalchbrenner Koray Kavukcuoglu
  8. Fast.ai documentation https://docs.fast.ai/