RoBERTa: Robustly Optimized BERT-Pretraining Approach

Rohan Jagtap


BERT (Devlin et. al.) is a pioneering Language Model that is pretrained for a Denoising Autoencoding objective to produce state of the art results in many NLP tasks. However, there is still room for improvement in the original BERT model w.r.t its pretraining objectives, the data on which it is trained, the duration for which it is trained, etc. These issues were identified by Facebook AI Research (FAIR), and hence, they proposed an ‘optimized’ and ‘robust’ version of BERT.

In this article we’ll be discussing RoBERTa: Robustly Optimized BERT-Pretraining Approach proposed in Liu et. al. which is an extension to the original BERT model. The prerequisite for this article would be general awareness about BERT’s architecture, pretraining and fine-tuning objectives, which by default includes sufficient awareness about the Transformer model (Vaswani et. al.).

If I were to summarize the RoBERTa paper in one line:

It essentially includes fine-tuning the original BERT model along with data and



