Using RoBERTa with for NLP

Implementing the current state of the art NLP model in

Dev Sharma
Analytics Vidhya
Published in
4 min readSep 2, 2019


This tutorial will walk you through integrating Fairseq’s RoBERTa model via Hugging Face’s Transformers and libraries. We will be building upon Keita Kurita’s article on Fine-Tuning BERT with Fast AI. Lastly, we will be using the IMDB dataset.

Update 2020.11: has upgraded to v2 since the release of this article. For compatibility with the steps below, v2 remains untested. Therefore, usage of v1 is recommended for following along with this article.

Fastai provides a streamlined interface to build datasets and train models. However, it doesn’t offer built in functionalities for current state of the art NLP models such as RoBERTa, BERT or XLNet (as of Sep 2019). Integrating these into Fastai can allow you to enjoy the convenience of Fastai methods in combination with the strong predictive power of these pretrained models.

The concept of Transfer Learning is still relatively new to NLP and one that is growing at a very rapid pace. Therefore, it is promising to see a model such as RoBERTa perform incredibly well on the SuperGLUE benchmark across several varying NLP tasks.

RoBERTa vs. other models on SuperGLUE tasks
RoBERTa vs. other models on SuperGLUE tasks. source

In essence, RoBERTa builds upon BERT by pretraining longer with more data, bigger batch sizes while only pretraining on masked language modeling as opposed to pretraining on next sentence prediction as well. The underlying architecture remains unchanged as both utilize masked language model pretraining. You can read here for more information on the differences.

0. Prerequisites

You will need to have both the Fastai and transformers libraries installed, preferably with access to a GPU device. For Fastai, you can follow instructions provided here. For Transformers:

pip install transformers

1. Setting Up the Tokenizer

First, let’s import relevant Fastai tools:

from fastai.text import *
from fastai.metrics import *

and Roberta’s Tokenizer from Transformers:

from transformers import RobertaTokenizer
roberta_tok = RobertaTokenizer.from_pretrained("roberta-base")

RoBERTa uses different default special tokens from BERT. For example, instead of [CLS] and [SEP] for starting and ending tokens, <s> and </s> are used respectively. For example, a tokenized movie review may look like:

“the movie was great” → [<s>, the, Ġmovie, Ġwas, Ġgreat, </s>]

We will now create a Fastai wrapper around RobertaTokenizer.

Now, we can initialize our Fastai tokenizer: (Note: we have to wrap our Fastai wrapper within the Tokenizer class for Fastai compatibility)

fastai_tokenizer = Tokenizer(tok_func = FastAiRobertaTokenizer(roberta_tok, max_seq_len=256), pre_rules=[], post_rules=[])

Next, we will load Roberta’s vocabulary.

path = Path()
with open('vocab.json', 'r') as f:
roberta_vocab_dict = json.load(f)

fastai_roberta_vocab = Vocab(list(roberta_vocab_dict.keys()))

2. Setting up the Databunch

Before we can build our Fastai DataBunch, we need to create appropriate pre-processors for the tokenizer and vocabulary.

Now, we will create a DataBunch class specifically for Roberta.

And lastly, we will also need a Roberta specific TextList class:

class RobertaTextList(TextList):
_bunch = RobertaDataBunch
_label_cls = TextList

3. Loading the Data

Whew, now that we have finished the involving set up process, we can bring it all together to read in our IMDB data.

df = pd.read_csv("IMDB Dataset.csv")feat_cols = "review"
label_cols = "sentiment"

We can now simply create create a Fastai DataBunch with:

processor = get_roberta_processor(tokenizer=fastai_tokenizer, vocab=fastai_roberta_vocab)data = RobertaTextList.from_df(df, ".", cols=feat_cols, processor=processor) \
.split_by_rand_pct(seed=2019) \
.label_from_df(cols=label_cols,label_cls=CategoryList) \
.databunch(bs=4, pad_first=False, pad_idx=0)

4. Building a Custom Roberta Model

In this step, we will define the model architecture to pass to our Fastai learner. Essentially, we add a new final layer to the output of the RobertaModel. This layer will be trained specifically for the IMDB sentiment classification.

Initialize the model:

roberta_model = CustomRobertatModel()

5. Train the Model

Initialize our Fastai learner:

learn = Learner(data, roberta_model, metrics=[accuracy])

Start training:

learn.model.roberta.train() # set roberta into train modelearn.fit_one_cycle(1, max_lr=1e-5)

After only a single epoch and without unfreezing layers, we achieve an accuracy of 94% on the validation set.

.941900 accuracy in a single epoch of training

You can now also utilize other Fastai methods such as:

# find an appropriate lr
# unfreeze layers
# train using half precision
learn = learn.to_fp16()

6. Creating Predictions

Since predictions are not outputted in order by Fastai’s get_preds function, we can use the following method.

def get_preds_as_nparray(ds_type) -> np.ndarray:

preds = learn.get_preds(ds_type)[0].detach().cpu().numpy()
sampler = [i for i in data.dl(ds_type).sampler]
reverse_sampler = np.argsort(sampler)
ordered_preds = preds[reverse_sampler, :]
pred_values = np.argmax(ordered_preds, axis=1)
return ordered_preds, pred_values
# For Valid
preds, pred_values = get_preds_as_nparray(DatasetType.Valid)

Note: if we had a test set, we could have easily added a test set during step 3 earlier by initializing “data” like this:

data = RobertaTextList.from_df(df, ".", cols=feat_cols, processor=processor) \
.split_by_rand_pct(seed=2019) \
.label_from_df(cols=label_cols,label_cls=CategoryList) \
.add_test(RobertaTextList.from_df(test_df, ".", cols=feat_cols, processor=processor)) \
.databunch(bs=4, pad_first=False, pad_idx=0)

Hence, if we had a test set, we could derive preds via:

test_preds = get_preds_as_nparray(DatasetType.Test)

Now, you have the capability to train on almost any text based dataset using RoBERTa with Fastai, combining two very powerful tools to produce effective results. You can access this tutorial’s jupyter notebook along with the data on my github page or kaggle kernel. (if you have trouble viewing the nb on github, use this link). If you are interested in seeing a similar implementation for SuperGLUE tasks, read on to my following work on Using RoBERTa with Fastai for SuperGLUE Task CB.