Evolving with BERT: Introduction to RoBERTa

Published in

Analytics Vidhya

8 min readJun 28, 2021

No matter how good it is, it can always get better, and that’s the exciting part.

In this article, I will discuss the "exciting part," which was how the Facebook Research AI agency modified the training procedure of the existing Google BERT, proving to the world that there is always room to improve.

Let’s look at the development of a robustly optimized method for pretraining natural language processing (NLP) systems(RoBERTa).

Open Source BERT by Google

Bidirectional Encoder Representations from Transformers, or BERT, is a self-supervised method released by Google in 2018.

BERT is a tool/model which understand language beter than any other model in the history. It’s freely available and it is increadibly versatily as it can solve a large number of problems related to lanugage tasks. You have used bert without even knowing!
If you have used google serch, you have already used BERT

Architecture:

Transformer model — a foundational concept for BERT

BERT is based on the Transformer model architecture

Examining the model as if it were a single black box, a machine translation application would take a sentence in one language and translate it into a different language.

Transformer performing the task of machine translation [Source]

Basic Transformer consists of an encoder to read the text input and a decoder to produce a prediction for the task.

The basic structure of a transformer [Source]

Since BERT’s goal is to generate a language representation model, it only needs the encoder part. hence, BERT is basically a trained Transformer Encoder stack

A basic structure of encoder block [Source]

To understand further implementations of its mechanism, refer to my previous blog

Training of BERT

During pretraining, BERT uses two objectives: masked language modeling and next sentence prediction.

Masked Language Modeling(MLM) basically masks 80% of the 15% of the randomly selected input tokens and uses the other tokens to attempt to predict the mask (missing word).

Next Sentence Prediction(NSP)is a binary classification loss for predicting whether two segments follow each other or are from a different document to create a semantic meaning.

Beginning of the Optimization of BERT: Introduction to RoBERTa

Room for improvement in BERT

BERT is significantly undertrained and the following areas stand the scope of modifications.

1. Masking in BERT training:

The masking is done only once during data preprocessing, resulting in a single static mask. Hence, the same input masks were fed to the model on every single epoch.

2. Next Sentence Prediction:

The original input format used in BERT is SEGMENT-PAIR+NSP LOSS.
In this, each input has a pair of segments, which can each contain multiple natural sentences, but the total combined length must be less than 512 tokens.
It is noticed that individual sentences hurt performance on downstream tasks, which according to the hypothesis is because the model was not able to learn long-range dependencies, hence the authors could experiment with removing/adding NSP loss to see the effect in the model’s performance.

3. Text Encoding:

The original BERT implementation uses a character-level BPE vocabulary of size 30K.
BERT uses the WordPiece method a language-modeling-based variant of Byte Pair Encoding.

4. Training batch size:

Originally BERT is trained for 1M steps with a batch size of 256 sequences, which shows room for improvement in perplexity on the Masked Language Modelling objective.

Altering the training Procedure:

1. Replacing Static masking with Dynamic Masking :

To avoid masking the same word multiple times, Facebook used Dynamic masking; the training data was repeated 10 times and every next time, the masked word would be different, meaning the sentence would be the same but the words masked would be different.

2. Removing NSP :

TEST 1: Feeding the following alternate training formats.

2.1. Retain NSP Loss:

SENTENCE-PAIR+NSP: Each input contains a pair of natural sentences, sampled from a contiguous portion of one document or separate documents. The NSP loss is retained.

2.2. Remove NSP loss:

FULL-SENTENCES: Each input is packed with full sentences sampled contiguously from single or cross documents, such that the total length is at most 512 tokens. We remove the NSP loss.
DOC-SENTENCES: Inputs are constructed similarly to FULL-SENTENCES, except that they may not cross document boundaries. Inputs sampled near the end of a document may be shorter than 512 tokens, so we dynamically increase the batch size in these cases to achieve a similar number of total tokens as FULLSENTENCES. We remove the NSP loss.

Results —

Removing the NSP loss matches or slightly improves downstream task performance, in contrast to the original BERT with NSP loss.
The sequences from a single document (DOC-SENTENCES) perform slightly better than packing sequences from multiple documents (FULL-SENTENCES) as shown in Table 1.

Table 1: Comparison of performance of models with and without NSP loss (image is taken from the paper)

3. Training with large mini-batch:

It is noticed that training a model with large mini-batches improves the perplexity of MLM objective and end-accuracy.

1M steps, batch size of 256 has equivalent computational cost to 31K steps, batch size of 8K.

Large batches are also easier to parallelize via distributed parallel training.

4. Byte-Pair Encoding:

Here Byte-Pair Encoding is used over raw bytes instead of Unicode characters.
The BPE subword vocabulary is reduced to 50K (still bigger than BERT’s vocab size) units.

A quick example of Byte Pair Encoding (BPE) [Source]

Despite degrading the performance of end-task performance in some cases, this method was used for encoding as it is a universal encoding scheme that doesn’t need any preprocessing and tokenization rules.

5. Increasing Training Data:

It was observed that training BERT on larger datasets, greatly improves its performance. Hence the training data was increased to 160GB of uncompressed text.

Google’s BERT after seeing the modifications! source

Facebook’s RoBERTa: An optimized method for pretraining self-supervised NLP systems

The issues discussed above were identified by Facebook AI Research (FAIR), and hence, they proposed an ‘optimized’ and ‘robust’ version of BERT.

RoBERTa is actually robust across NLU tasks, the absolute geniuses at Facebook actually did it, and it’s not a clickbait!

RoBERTa is part of Facebook’s ongoing commitment to advancing the state-of-the-art in self-supervised systems that can be developed with less reliance on time- and resource-intensive data labeling.

It acts more like just giving context to the rest of the input (the actual context), which is a reason why RoBERTa works better than models that were actually fine-tuned on tasks like SQUAD.

Why RoBERTa matters?

The world appreciates Google as they have made this natural language processing program open source to the world. Through RoBERTA, we see this move to open source BERT has brought a drastic change to NLP.

The study demonstrates how sufficiently pre-trained models lead to improvements with trivial tasks when scaled to extremities.

Improving the potential of BERT, is hugely impacting the the market and economic outlook.

BERT and RoBERTa are used in
the improvement in NLP tasks as they make use of embedding vector space that is rich in context.
Using RoBERTa for preprocessing the data, is of major advancement to all the small product to big multinational companies as mainly work to incorporate data for analysis to extract information.

This is significant because, as a result of these studies, experiments, and developments, we are getting closer to the larger challenge for NLP models, which is to achieve a human-level understanding of language!

How RoBERTa is different from BERT

The authors of RoBERTa suggest that BERT is largely undertrained and hence, they put forth the following improvements for the same.

More training data (16G vs 160G)
Uses dynamic masking pattern instead of static masking pattern.
Replacing the next sentence prediction objective with full sentences without NSP.
Training on Longer Sequences.

Accuracy vs Number of training steps plot for different models. [Source]

Comparison of BERT, RoBERTa, DistilBERT, XLNet in terms of training strategies

Zero-Shot Learning with RoBERTa

Zero-Shot Learning :

It is a type of Machine Learning technique, where the model is used without fine-tuning on a particular task.

You can use the following examples to implement any text sequence classification task (One-Shot Classification) by simply following the steps. It is extensively used also for sequence regression tasks.

Load RoBERTa from torch.hub

Loading pre-trained RoBERTa Large model using Pytorch

import torch
roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
roberta.eval()

Roberta For Sequence Classification:

RoBERTa Model transformer is with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.

1. Download RoBERTa that is already finetuned for MNLI

RoBERTa is also pretrained on MNLI (Multi-Genre Natural Language Inference is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information).

roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli')
roberta.eval()  # disable dropout for evaluation

2. Encode a pair of sentences and make a prediction

tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is not very optimized.')
roberta.predict('mnli', tokens).argmax()

3. Encode another pair of sentences

tokens = roberta.encode(‘Roberta is a heavily optimized version of BERT.’, ‘Roberta is based on BERT.’)
roberta.predict(‘mnli’, tokens).argmax()

Further Discussion

Transformers are extremely memory intensive. Hence, there is quite a high probability that we will run out of memory or above the runtime limit while training larger models or for longer epochs.

Hence, in my next blog, I’ll be discussing and implementing some promising well-known, and impactful out-of-the-box strategies to speed up transformer with optimization strategies to reduce the training time.

Creating bonds, stories, and magic together!

That’s me connecting with you through the screens! source

I am aware that technology is expanding, that minds are becoming more inquisitive, and that the results are becoming more interesting. And it is because of this that we continue to learn and grow!

Congratulations for making it to the end of the blog, that was me!

If you enjoyed learning about RoBERTa today, let me know what tech stacks you use in the comments below, and feel free to reach out for any queries.