To Pretrain or Not to Pretrain: Examining the Benefits of Pretraining on Resource Rich Tasks

Published in

Computers, Papers and Everything

5 min readJun 27, 2020

Since 2018, pretrained NLP models with variants of Masked Language Model (MLM) have gained a lot of popularity. This blog by Ankit Singh does an amazing job at explaining Bidirectional Encoder Representations from Transformers (BERT) where the concept of MLM was first introduced. Following BERT, a number of pretrained models have been developed. BERT-base was trained on 4 cloud TPUs for 4 days and BERT-large was trained on 16 TPUs for 4 days. In 2019, this paper brought down the training time to 76 minutes. Yet, it is undeniable that pretraining a model requires a lot of time and resources. Thus, our paper for today asks a critical question, whether pretraining a model has any benefit on resource rich tasks.

The future of NLP appears to be paved by pretraining a universal contextual representation on wikipedia-like data at massive scale. Attempts along this path have pushed the frontier to up 10× to the size of wikipedia (Raffel et al., 2019).

However, (Raffel et al., 2019) shows that it is not necessary for such models to always be state of the art. To understand whether pretraining a model has benefits or not, the authors evaluate the performance of pretrained models against a model trained from scratch. They focus on the task of multi-class text classification for two main reasons:

(i) it is one of most important problems in NLP with applications spanning multiple domains.
(ii) large sums of training data exists for many text classification tasks, or can be obtained relatively cheaply through crowd workers (Snow et al., 2008).

Datasets

Three sentiment classification datasets that range from 6 to 18 million examples are used for this comparative study:

Yelp Review
Amazon sports review
Amazon electronics review

Since the focus is on multi-class text classification, the goal of the models is to predict rating in five points scale {1, 2, 3, 4, 5}. The dataset size and distribution across the five points is shown in the table below. The authors split the dataset into 90% for training and 10% for testing.

Models

The three models used in the study are described below:

RoBERTa — A transformer based model pretrained with MLM objectives on a large corpus.
LSTM — The authors train a bidirectional LSTM.
LSTM + Pretrained Token Embedding — Initialized the token embeddings with Roberta pretrained token embedding. The embeddings are frozen during training.

Experimental Setup

Results

The results in the paper are explained based on two parameters — data size and inference time.

Impact of Data Size

The authors trained the models on varying sizes of the datasets to compare the performance of these models. 1%, 10%, 30%, 50%, 70% and 90% of data was used.
The results of the experiments are shown in Figure 1 and Table 2.

With an increase in the number of examples, the difference in accuracy between RoBERTa and LSTM decreases.

For example, when both models are trained with 1% of the Yelp dataset, the accuracy gap is around 9%. However, as we increases the amount of training data to 90%, the accuracy gap drops to within 2%. The same behaviour is observed on both Amazon review datasets, with the initial gap starting at almost 5% for 1% of the training data, then shrinking all the way to within one point when most of the training data is used.

2. Results show that an LSTM with pretrained RoBERTa token embeddings always outperforms the ones with random token initialization.

This suggests that the embeddings learned during pretraining RoBERTa may constitute an efficient approach for transfer learning the knowledge learned in these large MLM.

It is important to note that the accuracy gap between the models is within 2% for the Yelp dataset and less than 1% on the Amazon datasets. It is even more important to note that while RoBERTa-Large is trained on 304M parameters, LSTM-4–512 + Large is trained on 25M parameters. That is a difference of 279M parameters for a maximum accuracy gap of 1.71% on the Yelp dataset.

Inference Time

On investing the inference time of the three models on CPU and GPU the authors find that the LSTM model is 20 times faster even when compared to RoBERTa-Base as shown in Table 3. The authors made another interesting observation-

Another observation is that although using the Roberta pretrained token embedding introduces 10 times more model parameters compared to vanilla BiLSTM, the inference time only increases by less than 25%. This is due to the most additional parameters are from a simple linear transformation.

Conclusion

Our findings in this paper indicate that increasing the number of training examples for ‘standard’ models such as LSTM leads to performance gains that are within 1 percent of their massively pretrained counterparts.

The authors propose to run experiments on other large scale datasets to evaluate if these findings hold true for different NLP based tasks.

One way to interpret our results is that ‘simple’ models have better regularization effect when trained on large amount of data, as also evidenced in the concurrent work (Nakkiran and Sutskever, 2020).The other side of the argument in interpreting our results is that MLM based pretraining still leads to improvements even as the data size scales into the millions. In fact, with a pretrained model and 2 million training examples, it is possible to outperform an LSTM model that is trained with 3× more examples.

While we see that there is a trade-off between accuracy, number of parameters and amount of data needed for training, this paper helps us make better design decisions based on the resources we have available

References:

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y Ng. 2008. Cheap and fast — but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing, pages 254–263. Association for Computational Linguistics.
Kaplun G. Bansal Y. Yang T. Barak B. Nakkiran, P. and I. Sutskever. 2020. Deep double descent: Where bigger models and more data hurt. ICLR 2020.