A hands-on tutorial with codes
6 Steps to Build RoBERTa (a Robustly Optimised BERT Pretraining Approach)
You can learn how to build pretraining models for NLP Classification Tasks
In this article, a hands-on tutorial is provided to build RoBERTa (a robustly optimised BERT pre-trained approach) for NLP classification tasks.
The code is uploaded on Github [Click Here].
The problem of using latest/state-of-the-art models is the APIs are not easy to use and there are few documentation and tutorials (unlike use XGBoost or LightGBM).
Here, I try to simplify the steps to build more and have comments as more as possible. If your task is to build a classification (binary/multi-class) model by using text, you only need to change very few parameters/lines in step 2.
Feel free to use, modify the code and give feedbacks. Let’s learn by doing!
STEP 1 — IMPORT PACKAGES
In step 1, we need to import all packages as follows.
To simplify this step for people who use those packages for the first time, I highly recommend to use Google Colab and store files on Google Drive. Why? Most of the packages have been installed and you are free to use GPU on Colab.
The only package you need to install is ‘pytorch-transformers’.
STEP 2 — SET UP CONFIG
Almost all changes should be here in ‘config’. In particular, they are hyperparameters of the model, the path of files and the column names.
To have a quick try and a simple model with fewer parameters, I suggest:
roberta_model_name: 'roberta-base'
max_seq_len: about 250
bs: 16 (you are free to use large batch size to speed up modelling)
To boost accuracy and have more parameters, I suggest:
roberta_model_name: 'roberta-large'
max_seq_len: over 300
bs: 4 (because of the limitation of GPU memory)
To have multiple different models and use ensemble learning in the future, I suggest to change:
seed
valid_pct
hidden_dropout_prob
To build binary models or multi-class models, you can change:
num_labels
STEP 3 — SET UP TOKENIZER
You don’t need to change anything here :D
STEP 4 — SET UP DATABUNCH
As the path and column names have already been set up in step 2, you don’t need to change anything here.
STEP 5 — TRAINING AND EVALUATION
Again, You don’t need to change anything here! I left some comments below so that you are easy to modify the code if you want.
STEP 6 — PREDICTION
This is the last step! You can get the prediction and If you want to get probability instead of 0/1, you need to change the return in get_preds_as_nparray.
You can check 7 different evaluation metrics for classification models here.
RESOURCES
fast.ai: https://www.fast.ai/
transformers Github: https://github.com/huggingface/transformers
transformers documentation: https://huggingface.co/transformers/
One example of using RoBERTa in binary classification problem on Github: https://github.com/devkosal/fastai_roberta/blob/master/fastai_roberta_imdb/Using%20RoBERTa%20with%20Fastai%20Tutorial.ipynb
fast.ai Youtube tutorials: https://www.youtube.com/playlist?list=PLtmWHNX-gukKocXQOkQjuVxglSDYWsSh9
BERT paper: https://arxiv.org/pdf/1810.04805.pdf
RoBERTa paper: https://arxiv.org/pdf/1907.11692.pdf