‘Train Large, Then Compress’ — UC Berkeley BAIR Improves Large Transformer Model Training and Inference

Published in

SyncedReview

3 min readMar 9, 2020

In the current state of deep learning, methods that can be used to improve model accuracy basically come down to increasing model size, dataset size, or number of training steps. These methods however require large and very expensive compute resources. Optimizing computing efficiency has become a key goal for researchers when computing resources are limited. How to achieve higher accuracy with limited hardware support and training time?

To address this issue, researchers from the Berkeley Artificial Intelligence Research (BAIR) Lab at UC Berkeley explored the effect of Transformer model size on training and inference efficiency. Their new paper shows that with limited resources, training and inference efficiency can be improved by significantly increasing the size of the Transformer models and heavily compressing them.

Under the usual presumption that models are trained to convergence, only small models that are fast-to-execute are feasible in resource-constrained settings. The work shows that the most compute-efficient training scheme is instead to train very large models, stop them well short of convergence, and then heavily compress them to meet test-time constraints.

The researchers conducted several experiments and found that in a given time, the deeper RoBERTa model (RoBERTa is an optimized BERT pretraining approach) with more layers had lower perplexity than the model with fewer layers. The wider RoBERTa model also had lower perplexity.

Researchers also evaluated the validation BLEU score of models in different sizes when training an English-French transformer machine translation model. BLEU score is an automatic evaluation metric for machine translation (the higher, the better). In the same training time, deeper and wider models outperformed the smaller models. Researchers also found that increasing model width or depth resulted in faster training for RoBERTa pretraining, and that the wider model works better in machine translation tasks.

Although training a larger model can deliver higher efficiency, this also raises the computation and memory cost of inference, and the total cost of inference is much higher than the training cost in most practical applications. The “Train Large, Then Compress” approach can solve this problem. Researchers used compression techniques such as quantization and pruning, both of which can reduce inference latency and memory requirements.

In the case of RoBERTa, the researchers first pretrained different size RoBERTa models with the same given time, then fine-tuned these models on a downstream text classification task and applied pruning or quantization methods for compression. It was found that in a given test time, increasing model size and then applying heavy compression worked best.

Researchers conducted a preliminary investigation of their findings limited to the field of natural language processing, and say their conclusions could be further explored in the other fields in the future.

The paper Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers is on arXiv.

Author: Herin Zhao | Editor: Michael Sarazen

Thinking of contributing to Synced Review? Synced’s new column Share My Research welcomes scholars to share their own research breakthroughs with global AI enthusiasts.

We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.

Need a comprehensive review of the past, present and future of modern AI research development? Trends of AI Technology Development Report is out!

2018 Fortune Global 500 Public Company AI Adaptivity Report is out!
Purchase a Kindle-formatted report on Amazon.
Apply for Insight Partner Program to get a complimentary full PDF report.

‘Train Large, Then Compress’ — UC Berkeley BAIR Improves Large Transformer Model Training and Inference

Written by Synced