Universal Language Model Fine Tuning (ULMFiT)

8 min readMay 8, 2024

When training deep neural networks (DNNs) or complex architectures from deep learning, a situation which usually occurs while performing NLP based tasks is the need for vast amounts of domain data and computational resources, which hinders the overall progress. In comes a concept which is used widely so as to reduce time and computational complexity is the process of Transfer Learning.

This concept follows the principle of the model taking information or knowledge from another pre-trained model which is trained on a relatively larger dataset.

In many a cases when the training data for the current task is limited, the target task is related to the pre-trained model’s original task and/or when computational resources are limited, this technique is widely used. Transfer learning has allowed in reducing training time, improving performance and leveraging the expertise.

There are two widely popular ways of performing transfer learning, which are :

Inductive Transfer

This is a fairly popular method of doing transfer learning. The core idea of this is to perform transfer learning on tasks different in nature, where the data distribution may potentially be different.
For example, take a target task of classifying airplanes and cars with the data as images. Now for that let us assume that from a source task, we use the knowledge from a pre-trained model which learns to classify dogs and cats via images. The tasks are essentially different in nature, but the knowledge learned in the source tasks such as edges, shapes, textures and contours may be useful in the target task as well.

Transductive Transfer

Another approach is the transductive transfer learning, which is relatively less common as it tends to be useful in specific use cases. Here, the fundamental thought is that it has to deal with tasks same in nature, such as the distribution of the data is different in nature.
Let us take the example where the source task is training a spam classification model based on a large data set of emails and then using it to filter span emails from a completely different user. Here, although the task is the same, the source and target data are heterogeneous. Here, the model can leverage on the knowledge it has gained of spam patterns from source data to identify similar patterns in target data.

However, for many NLP based tasks the process of inductive learning via fine-tuning did not prove to be successful. For many real world tasks in this field, it required huge amounts of in-domain documents as training data, as it was not able to leverage the previous knowledge.

Although it was quite effective in computer vision tasks, these approaches faced limitations in NLP and language based tasks. Traditional fine-tuning, where a pre-trained model’s weights are frozen and only the final layers are adapted which are built on top of the previous knowledge base, often required large amounts of in-domain data for NLP tasks. Along with this, it also limits the usability of the knowledge gained from the pre-trained model.

Owing to the above, a research conducted by Jeremy Howard, Sebastian Ruder in 2018 proposed a new model, namely the Universal Language Model Fine Tuning (ULMFiT), an effective transfer learning model which could be applied to any task in the NLP domain. To read up more about this paper, visit the paper here.

ULMFiT overcomes the limitations of fine-tuning, as it allows the model to leverage knowledge from a massive pre-trained language model, which in turn significantly improves performance on various NLP tasks even with limited data so to say. Let us see how it has been implemented.

Architecture

So as per the research, the architecture with which this model will be built was of an Average Weighted Decay Long Short Term Memory (AWD-LSTM). It is built as a variant of the regular LSTM, along with employing new techniques so as to prevent overfitting.

To understand it better, let us see the two core principles/concepts incorporated in this architecture :

Average Weight Updates Here, this architecture uses a modified version of the gradient update rule, that instead averages out weight updates by taking previous weights into account, over the course of multiple iterations. This helps to smoothen out the training process as well as reduce overfitting.
Weight Dropout In the dropout strategy generally, random activations between the network are “dropped” out or made inactive. Over here, instead of dropping those activations, WeightDropout randomly sets a subset of weights in the hidden layer connections to zero during training process. With this, the network is forced to learn from different subsets of information which in turn prevents over reliance.

To read up more about this, you can check out the research paper on LSTMs here.

Model

As we move into the ULMFiT model, let us understand the three core steps or sections of the proposed model, which work in succession and are as follows :

General domain LM pre-training
Target task LM fine-tuning
Target task classifier tuning

Let us look at the image of the proposed model to get an overview :

In the first section, the general domain LM pre-training is the employment of a huge data corpus to be trained based, with the purpose of learning language modeling and further helping in NLP based tasks.

This process is the most resource extensive part of the model, as it requires a huge corpus of general domain language data, so as to capture general properties of the language and learning the fundamentals of language and how words interact within a broader context.

Although this process is costly, it is only to be done once as it can act as a foundation for multiple range of tasks in the domain of NLP. As of today, the training for such tasks is done by many sources and is available on open source platforms. Now that the model has a strong foundation in general language understanding from pre-training, let’s move ahead and see how ULMFiT fine-tunes it for particular NLP based tasks.

As we move ahead, we head into the target task LM fine-tuning section. No matter how diverse the above general domain data is, eventually what matters for the given task is it’s targeted data, as it most likely would come from a different distribution. Therefore, this section fine-tunes the LM on the data of the target task. Given the pre-trained data from the previous section, this part converges faster as it only has to adapt to the scope and parameters of the target data, and allows to train robust LM even for smaller datasets. This is done with the help of two components, which are :

Discriminative fine-tuning

This is quite a clever technique which has been employed in the ULMFiT model. It builds on the hypothesis that since different layers from the pre-trained model capture different aspects of the language across levels (such as lower layers capturing basic features of the language and higher layers capturing complex relationships such as semantic meaning and integration of different parts of speech).
This method fine-tunes each layer with its own specific learning rate as it allows the model to focus on adjusting the more relevant information of a particular layer for the target task, which improves performance.

Slanted triangular learning rates

Since it was preferred that the model quickly converges to a good starting point and then move towards refinement, the idea of a constant learning rate was not suitable for this task, following which this approach was applied.
This approach is a variant of the triangular learning rate, where it uses a faster initial learning rate to find a good area in the parameter space as per the model’s configurations. Then, it has a gradual decline in learning rate, which is longer than the first part of this process.
This allowed the model to converge faster as first it got a suitable parameter space quickly with a steeper linear increment, and then gradually performed the refinement of the parameters with linear decline.

Finally, we move towards the target task classifier tuning. With the above sections we have trained to understand language, but here we try to fine tune it for specific tasks such as sentiment analysis, topic classification, summarisation etc.

We can interpret this in the following way :

The previous two sections gained knowledge for the model and made it into a “language expert”. It now has an understanding of the fundamentals of the language and has a strong foundation in general grammar and vocabulary.
Now, if we ask this expert to specialise in a particular field, such as movie reviews. So to achieve that, this expert will have to take up additional courses or readings on movie reviews to understand the text better.

Similarly, here a new layer is added on top of the pre-trained model, and it does require careful fine-tuning so as to avoid “catastrophic forgetting” (basically forgetting the knowledge gained from the pre-trained model) and for that, various techniques are used such as :

Gradual unfreezing

Starting from the last layer, each layer is unfrozen one at a time, as the last layer contains the least generalised knowledge and is trained in that given epoch. Gradually, we work up towards the earlier layers, ensuing that the model retains its core language understanding while adapting to the specific task.

Concatenation pooling

As we know, not all words carry the same weightage in terms of meaning in a given document. Hence, to capture the most informative parts of the text, the ULMFiT combines the final hidden state of the model with both the most informative hidden state i.e. the layer which has the most information via max pooling, and the average information across all hidden states via mean pooling.

Thus, by strategically combing these techniques in a particular order, the ULMFiT effectively fine tunes the pre trained model for various NLP tasks.

Limitations

Although the model has proved to generate remarkable results, there may still be some limitations which need to be addressed before we can practically scale such an approach, such as :

Computational cost

It is well known that for training the first section i.e. the general domain LM pre-training, a large amount of data corpus and computational resources are required. This does put a hinderance in the scaling of this approach for certain tasks.

Data quality limitations

As we know, data in NLP based tasks for many type of problems can be unlabelled, incomplete or even faulty in it’s collection process. The element for bias and quality of the data is something which can be further looked into.

Future Direction

As we can see in the research, the ULMFiT model excelled fairly in the text classification tasks, as compared to the previous SOTA models that too with minimal training data. This implies that the model can be used for other tasks as well such as :

Non-english languages

Since the amount of supervised data in many languages apart from english is quite scarce, this model’s approach to fine-tune pre-trained models can be very useful in this domain, despite the target language’s corpus being significantly smaller than english.

New NLP tasks

Tasks which are relatively newer in nature such that no SOTA models exist to give it a proper framework. Hence, the usage of pre-trained model can help achieve great results without necessarily starting from scratch.

Limited labeled dataset

It is a fact that many tasks suffer in the field of NLP, owing to the lack of labeled data. This is where ULMFiT can step in to leverage from unlabelled data and bridge the data gap.

Conclusion

We can explore many such exciting avenues as it is believed that ULMFiT has the potential to majorly impact on how we understand and approach many NLP based tasks, especially those with limited data or newer emerging challenges.

Credits

I would like to take the opportunity to thank Krish Naik for his series in deep learning on his Youtube channel, which has allowed me to learn and present the above article. You can check out his Youtube channel here. Thanks for reading!