Going the extra mile, lessons learnt from Kaggle on how to train better NLP models (Part II)

Published in

MantisNLP

7 min readFeb 23, 2022

In Part I, we presented the challenge and constructed a basic transformer model that we fine-tuned on the data. The score we got previously was 0.624. (As a reminder, a smaller score is better).

Now that we have set the basic version of our model, it’s time to improve it. We’ll go through some of the more popular techniques that are used in NLP competitions, add them to our model and explain each of them.

Using a different scheduler

For our basic model we have used a constant learning rate scheduler, which means that the learning rate we initialize in the beginning does not change. Most learning rate schedulers implement a form of learning rate decay — the initial learning rate gets gradually reduced in accordance with the scheduler.

Intuitively, this means that early in the training process, we reach a set of weights that are good quickly (as the learning rate is high), and subsequently fine-tune to reach higher accuracy by using smaller learning rates. Using this approach the model converges faster, a phenomenon described in the paper Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates.

In our case, we use a cosine schedule with warmup. This means that in the warmup period the learning rate increases from 0 to a predefined value lr, then decreases following the values of the cosine function between the set lr to 0.

While there are other more complex schedulers, this is a good starting point and the default implementation in Pytorch OneCycleLR policy

The warmup phase helps avoid the instability of the last linear layer. More information and justification can be found in the paper A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation.

Example of cosine schedule with warmup of 100 steps and lr=1

The Hugging Face transformers library provides a very simple way of using different schedulers, and so all we have to do is to replace the get_constant_schedule we had before with:

With this small adjustment, we get a new public test score of 0.610 (0.624 previously). Notebook here.

Differential learning rates

Differential learning rates is another technique for fast and efficient transfer learning that we can additionally use.

The idea was introduced by Jeremy Howard (fast.ai) in his course “Practical Deep Learning for coders, part 1, version 2”.

The intuition behind differential learning rates is that the initial layers in the pre-trained model are more likely to have learned general features, and so we don’t want to modify those too much. So we set the learning rate to be smaller for those layers, and increase it for the last layers.

In our case, we set 2e-5 for the first 3 layers, 5e-5 from layer 4 until layer 8, and 1e-4 from layer 8 until the end.

If we look at this sample of layer outputs we can see that layer 4 starts at index 69.

So in our create_optimizer() function we add this bit of code:

This gets us a new public test score of 0.558. Notebook here.

Pre-train the pre-trained model

Most pre-trained models are trained on the masked token prediction task. We can try to further pre-train the model on the same task by using our training data. In theory this will adjust the language model to better understand the language of our training data, so that when we fine-tune it for classification the new tokens or sentences aren’t completely new to it. Read more about the importance of pre-training in the paper Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks.

We further pre-train the model on a separate notebook. DataCollatorForLanguageModeling makes this quite simple, we just need to give it a mlm_probability value, and it automatically masks the text and prepares it for training. We can then take the output of that notebook and use it as input for our main notebook, essentially just replacing the Hugging Face pre-trained model with our further pre-trained model.

Here’s the essential part of the code:

The number of epochs on which we pre-train is important. Here is a table that shows how our score is affected only by that factor:

The only thing we have to modify in our main notebook is to use the path to this new further-pre-trained model:

Using the version trained for 30 epochs, we now have a new public best score of 0.527. Notebook here

We will dive deeper into the pre-training aspect in a future blog, stay tuned for that. For now, let’s continue.

Ensembling with k-fold cross validation

Ensembling is a machine learning technique that combines several base models in order to produce a single optimal predictive model. One way of doing this whilst also creating a validation set to test our changes is cross validation.

We split the dataset into k folds and use each of them as a validation set one at a time while the remaining data from the other k folds comprise the training set.

This technique offers a few benefits:

Helps in combating overfitting
The final mean of the validation scores can usually be a good indicator of your test score
Ensembling will usually improve the score

One major drawback is that of course it means the training time gets multiplied by approximately k-1, so you usually want to leave this for when you’ve already experimented with other techniques. Since in our case the training data is pretty small the additional training time is not a problem and is still in the half-hour range.

You can view the notebook here.

We’ve changed the train() function to receive a val_loader parameter that will represent the validation data which is used now for evaluating the model. The last validation result is also returned as we will calculate the mean of all the folds’ performance on the validation set at the end of the training.

We split the training set using the StratifiedKFold algorithm from sklearn. This helps us preserve the percentage of samples for each class.

Since in our case the targets are floats and not classes, we introduce a new ‘bins’ column. Each bin number represents a discrete interval of the targets. We get those discrete intervals with the help of the pandas.cut function. You can read more about “binning” in this “Data Binning with Pandas Cut or Qcut Method” article.

We go through each fold and train a new model for each of them, saving the model’s weights.

For prediction we’ll then load each model, use it to predict the results of the test set and then average those results (the ensembling part).

Note that since our folds are smaller than the full data we were training on before, we needed to increase the number of epochs for which we train each model as one epoch was not enough time for it to converge. With 3 epochs, we get the new score of 0.523. Notebook here.

We may be still overfitting, so let’s solve this with the next smart technique.

Adding model checkpoints

Since we have a validation set, we can see when our model is overfitting and stop training so that our results are not affected by it. .

A variation on early stopping is to save models at defined checkpoints. This way we don’t stop training as soon as the model overfits, we leave it to train for all the epochs. Evaluating the model after every epoch can be time consuming, but we can evaluate the model based on an evaluation schedule that intensifies the evaluation as the score increases.

This means that for scores above 0.50 we’ll only evaluate once every 16 steps, between 0.48 and 0.5 once every 8 steps, between 0.465 and 0.48 once every 2 steps and below 0.465 every step.

Running this, we get the new public score of 0.488! Notebook here.

Adding an attention head

Attention is already present in our model in each of the 12 transformer layers in RoBERTa. But currently we’re simply applying a regressor layer on top of the last layer’s output.

The only modifications we need to make are in the LitModel class code. You can see the notebook here.

We declare the attention neural network as below:

The size of the hidden state of each cell is 768 for roberta-base. In order to be able to condense the hidden states to a context vector we compute the weights based on the attention module defined above.

Once we have the weights we compute the context_vector as the weighted average of the last hidden states. Finally we apply the simple regression layer on that context_vector.

So, while previously we were making use only of the output from the last layer with `roberta_output.last_hidden_state[:,0,:]` we now make use of the whole weights from the last layer `roberta_output.hidden_states[-1]` by applying a custom attention layer on them meant to help us condense the useful information and give better features to our regression layer.

Of course, that last hidden state is also a product of attention. We are simply combining the information and training our own attention weights on top of it.

The new public score after using this is 0.482. Notebook here.

Conclusions

This is how we improved the score with each technique that we implemented:

In the end we improved our score from 0.624 to 0.482, a 22% decrease. We got closer to the best scores in the leader-board that are in the 0.45 area.

In this blog every addition we made improved the score. In the next blog post we will focus on more techniques that are worth trying in general, even if in this case they did not improve the results.