Optimizing BERT with RoBERTa | Towards AI
A Robustly Optimized BERT Pretraining Approach
What is BERT?
BERT (Devlin et al., 2018) is a method of pre-training language representations, meaning that we train a general-purpose “language understanding” model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.
Liu et al. studied the impact of many key hyper-parameters and training data size of BERT. They found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it.
RoBERTa (Robustly optimized BERT approach) is introduced and performance is either matching or exceeding original BERT.
BERT Training Objective
Masked Language Models (MLM) and
Next Sentence Prediction (NSP) to learn text representation.
MLM is a way to mask some tokens and using the rest of tokens to predict the masked token.
NSP is predicting whether a pair of sentences is continuous. If you want to learn more about BERT, you may visit this story.
As RoBERTa is developed based on BERT, they share lots of configs. The following items are different between RoBERTa and BERT:
- Reserved Token: BERT uses
[SEP]as starting token and separator token respectively while RoBERTa uses
</s>to covert sentences.
- Size of Subword: BERT has around 30k subwords while RoBERTa has around 50k subwords.
RoBERTa performs better than BERT by applying the following adjustments:
- Bigger training data (16G vs 161G)
- Using dynamic masking pattern (BERT use static masking pattern)
- Replacing the next sentence prediction training objective
- Training on longer sequences
Bigger Training Data
RoBERTa uses BookCorpus (16G), CC-NEWS (76G), OpenWebText (38G) and Stories (31G) data while BERT only uses BookCorpus as training data only.
Static Masking vs Dynamic Masking
BERT masks training data once for MLM objective while RoBERTa duplicates training data 10 times and masking those data differently. In the following experiment, you notice that dynamic masking perform better than static masking and reference (BERT).
Different Training Objective
Lie proposed the following ways to evaluate the useless of
NSP objective. No
NSP training is applied to FULL-SENTENCES and DOC-SENTENCES approaches
- SEGMENT-PAIR with NSP: A pair of segments which can each contain multiple natural sentences. It is the same as the original BERT training objective. The number of token is less than 512.
- SENTENCE-PAIR with NSP: A pair of natural sentences, either sampled from a contiguous portion of one document or from separate documents. It is slightly different from the original BERT approach. The number of token is significantly less than 512.
- FULL-SENTENCES without NSP: Inputs are packed with sentences that are sampling from one or more documents. When training data is reached the end of document, sentences from other documents will be sampled. The number of token is at most 512.
- DOC-SENTENCES without NSP: Same as FULL-SENTENCES expect data do not across the document.
From the following experiments, we noticed that without
NSP training ways to achieve a better result.
Training on Longer Sequences
BERT-BASE (Devlin et al., 2018) is trained via 1M steps with a batch size of 256 sequences. While Lie et al. trained 125k steps, 2k sequences and 31k steps, 8k sequences. The following experiment shows 125k steps with 2k sequences achieve a better result.
- RoBERTa further tuned BERT by increasing data size and hyper-parameters only.
Like to learn?
I am a Data Scientist in the Bay Area. Focusing on the state-of-the-art in Data Science, Artificial Intelligence, especially in NLP and platform related. Feel free to connect with me on LinkedIn or Github.
- Introduction to BERT
- Original implementation from Facebook (PyTorch)
- Hugging Face implementation (PyTorch)
- J. Devlin , M. W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. 2019.