RoBERTa: Robustly Optimized BERT-Pretraining Approach

Understanding Transformer-Based Self-Supervised Architectures

Published in

DataSeries

6 min readAug 19, 2020

BERT (Devlin et. al.) is a pioneering Language Model that is pretrained for a Denoising Autoencoding objective to produce state of the art results in many NLP tasks. However, there is still room for improvement in the original BERT model w.r.t its pretraining objectives, the data on which it is trained, the duration for which it is trained, etc. These issues were identified by Facebook AI Research (FAIR), and hence, they proposed an ‘optimized’ and ‘robust’ version of BERT.

In this article we’ll be discussing RoBERTa: Robustly Optimized BERT-Pretraining Approach proposed in Liu et. al. which is an extension to the original BERT model. The prerequisite for this article would be general awareness about BERT’s architecture, pretraining and fine-tuning objectives, which by default includes sufficient awareness about the Transformer model (Vaswani et. al.).

I have already covered Transformers in this article; and BERT in this article. Consider giving them a read if you’re interested.

RoBERTa

If I were to summarize the RoBERTa paper in one line:

It essentially includes fine-tuning the original BERT model along with data and…

RoBERTa: Robustly Optimized BERT-Pretraining Approach

Understanding Transformer-Based Self-Supervised Architectures

RoBERTa

Written by Rohan Jagtap