A Robustly Optimized BERT Pretraining Approach

What is BERT?

Edward Ma
Edward Ma
Sep 11 · 4 min read

BERT (Devlin et al., 2018) is a method of pre-training language representations, meaning that we train a general-purpose “language understanding” model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.

Liu et al. studied the impact of many key hyper-parameters and training data size of BERT. They found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. RoBERTa (Robustly optimized BERT approach) is introduced and performance is either matching or exceeding original BERT.

BERT Training Objective

BERT uses Masked Language Models (MLM) and Next Sentence Prediction (NSP) to learn text representation. MLM is a way to mask some tokens and using the rest of tokens to predict the masked token. NSP is predicting whether a pair of sentences is continuous. If you want to learn more about BERT, you may visit this story.

Model Setting

As RoBERTa is developed based on BERT, they share lots of configs. The following items are different between RoBERTa and BERT:

  • Reserved Token: BERT uses [CLS] and [SEP] as starting token and separator token respectively while RoBERTa uses <s> and </s> to covert sentences.
  • Size of Subword: BERT has around 30k subwords while RoBERTa has around 50k subwords.


RoBERTa performs better than BERT by applying the following adjustments:

  1. Bigger training data (16G vs 161G)
  2. Using dynamic masking pattern (BERT use static masking pattern)
  3. Replacing the next sentence prediction training objective
  4. Training on longer sequences

RoBERTa uses BookCorpus (16G), CC-NEWS (76G), OpenWebText (38G) and Stories (31G) data while BERT only uses BookCorpus as training data only.

BERT masks training data once for MLM objective while RoBERTa duplicates training data 10 times and masking those data differently. In the following experiment, you notice that dynamic masking perform better than static masking and reference (BERT).

Comparison of masking method (Lie et al., 2019)

Lie proposed the following ways to evaluate the useless of NSP objective. No NSP training is applied to FULL-SENTENCES and DOC-SENTENCES approaches

  • SEGMENT-PAIR with NSP: A pair of segments which can each contain multiple natural sentences. It is the same as the original BERT training objective. The number of token is less than 512.
  • SENTENCE-PAIR with NSP: A pair of natural sentences, either sampled from a contiguous portion of one document or from separate documents. It is slightly different from the original BERT approach. The number of token is significantly less than 512.
  • FULL-SENTENCES without NSP: Inputs are packed with sentences that are sampling from one or more documents. When training data is reached the end of document, sentences from other documents will be sampled. The number of token is at most 512.
  • DOC-SENTENCES without NSP: Same as FULL-SENTENCES expect data do not across the document.

From the following experiments, we noticed that without NSP training ways to achieve a better result.

Comparison of Training Objective (Lie et al., 2019)

BERT-BASE (Devlin et al., 2018) is trained via 1M steps with a batch size of 256 sequences. While Lie et al. trained 125k steps, 2k sequences and 31k steps, 8k sequences. The following experiment shows 125k steps with 2k sequences achieve a better result.

Comparison of Hyper-parameters (Lie et al., 2019)

Take Away

  • RoBERTa further tuned BERT by increasing data size and hyper-parameters only.

Like to learn?

I am a Data Scientist in the Bay Area. Focusing on the state-of-the-art in Data Science, Artificial Intelligence, especially in NLP and platform related. Feel free to connect with me on LinkedIn or Github.

Extension Reading


Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Edward Ma

Written by

Edward Ma

Focus in Natural Language Processing, Data Science Platform Architecture. https://makcedward.github.io/

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade