A hands-on tutorial with codes

6 Steps to Build RoBERTa (a Robustly Optimised BERT Pretraining Approach)

Jin
Analytics Vidhya
Published in
3 min readDec 28, 2019

--

You can learn how to build pretraining models for NLP Classification Tasks

Photo by Annie Spratt on Unsplash

In this article, a hands-on tutorial is provided to build RoBERTa (a robustly optimised BERT pre-trained approach) for NLP classification tasks.

The code is uploaded on Github [Click Here].

The problem of using latest/state-of-the-art models is the APIs are not easy to use and there are few documentation and tutorials (unlike use XGBoost or LightGBM).

Here, I try to simplify the steps to build more and have comments as more as possible. If your task is to build a classification (binary/multi-class) model by using text, you only need to change very few parameters/lines in step 2.

Feel free to use, modify the code and give feedbacks. Let’s learn by doing!

STEP 1 — IMPORT PACKAGES

In step 1, we need to import all packages as follows.

To simplify this step for people who use those packages for the first time, I highly recommend to use Google Colab and store files on Google Drive. Why? Most of the packages have been installed and you are free to use GPU on Colab.

The only package you need to install is ‘pytorch-transformers’.

STEP 2 — SET UP CONFIG

Almost all changes should be here in ‘config’. In particular, they are hyperparameters of the model, the path of files and the column names.

To have a quick try and a simple model with fewer parameters, I suggest:

roberta_model_name: 'roberta-base'
max_seq_len: about 250
bs: 16 (you are free to use large batch size to speed up modelling)

To boost accuracy and have more parameters, I suggest:

roberta_model_name: 'roberta-large'
max_seq_len: over 300
bs: 4 (because of the limitation of GPU memory)

To have multiple different models and use ensemble learning in the future, I suggest to change:

seed
valid_pct
hidden_dropout_prob

To build binary models or multi-class models, you can change:

num_labels

STEP 3 — SET UP TOKENIZER

You don’t need to change anything here :D

STEP 4 — SET UP DATABUNCH

As the path and column names have already been set up in step 2, you don’t need to change anything here.

STEP 5 — TRAINING AND EVALUATION

Again, You don’t need to change anything here! I left some comments below so that you are easy to modify the code if you want.

STEP 6 — PREDICTION

This is the last step! You can get the prediction and If you want to get probability instead of 0/1, you need to change the return in get_preds_as_nparray.

You can check 7 different evaluation metrics for classification models here.

--

--