Taking advantage of multiple GPUs to train larger models such as RoBERTa-Large on NLP datasets
This article is co-authored by Saichandra Pandraju.
This tutorial will help you implement Model Parallelism (splitting the model layers into multiple GPUs) to help train larger models over multiple GPUs. We would be using the RoBERTa-Large model from Hugging Face Transformers. This approach helps achieve Model Parallelism just with PyTorch and without using any PyTorch wrappers such as Pytorch-Lightning.
Lastly, we would be using the IMDB dataset of 50K Movie Reviews for fine-tuning our RoBERTa-Large model
First, you will need a machine/VM with multiple GPUs. You can ensure multiple GPUs are available using the
Note: Though we have 2 * 32GB of GPU available, and we should be able to fine-tune the RoBERTa-Large model on a single 32GBB GPU with lower batch sizes, we wanted to share this idea/method to help people with lesser available GPU.
The version of the libraries used for this demonstration are
The IMDB Dataset can be downloaded from Kaggle and notebook is available on my GitHub.
IMDB Dataset Task
For each text movie review, the model has to predict a label for the sentiment. We evaluate the outputs of the model on classification accuracy.
0 → Negative
1 → Positive
1. Loading the Data
The data is loaded into the dataframe using pandas. We are loading only the top 10k samples in account of training time.
Also, we convert the values in sentiment column from positive and negative to 1 and 0 respectively and drop the existing sentiment column.
2. Initializing RoBERTa Model from Transformers
As mentioned in Maximilien Roberti’s article, In transformers, each model architecture is associated with three main types of classes.
- A model class to load/store a particular pre-train model.
- A tokenizer class to pre-process the data and make it compatible with a particular model.
- A configuration class to load/store the configuration of a particular model.
For example, if you want to use the RoBERTa architecture for text classification, you would use
RobertaForSequenceClassification for the model class,
RobertaTokenizer for the tokenizer class and
RobertaConfig for the configuration class.
Later, you will see that those classes share a common class method
from_pretrained(pretrained_model_name, ...). In our case, the parameter
pretrained_model_name is a string with the shortcut name of a pre-trained model/tokenizer/configuration to load, e.g
roberta-large. We can find all the shortcut names in the
transformers documentation here.
3. Creating Torch Dataset and DataLoader
Now, we create the building blocks to create a PyTorch dataset.
The tokenizer does most of the heavy lifting for us. We also return the return_text, so it’ll be easier to calculate the predictions from our model.
Now, lets split the data into train and val sets:
We also need to create a couple of dataloaders.
4. Create the Multi GPU Classifier
In this step, we will define our model architecture. We create a custom method since we’re interested in splitting the roberta-large layers across the 2 available GPUs.
The roberta-large model consists of 1 Embedding, 23 layers under Encoder and a 1 Classifier (RobertaClassificationHead) Layer.
In this implemention, we split the layers of the roberta-large model in the below mentioned fashion:
- Embedding Layer → cuda : 0
- Encoder Layers → cuda : 1
- Classifier Layer → cuda : 1
Note: Since all the 23 layers under Encoder is present in the GPU 1, it is likely that more of GPU 1’s resources would be utilized while training. We are working on figuring out a way to split the individual encoder layers to equalize GPU usage across the GPUs.
Upon initializing our model, we should be able to see the memory being used from both the GPU’s instead of single GPU using
5. Training the Model
We’ll use the AdamW optimizer provided by Hugging Face. It corrects weight decay, so it’s similar to the original BERT paper. We’ll also use a linear scheduler with no warmup steps along with Cross Entropy Loss as the loss function.
Now, we create a helper function for training our model:
and another helper function for evaluating our model:
Using the training and evaluation helper functions, we can create our training loop. We’ll also store our training history:
Train Loss: 0.3271573152278683 ; Train Accuracy: 0.8830000000000001 Val Loss: 0.19180121847156745 ; Val Accuracy: 0.9593333333333333 Epoch 2/2
Train Loss: 0.14536982661844897 ; Train Accuracy: 0.9690000000000001 Val Loss: 0.2031460166494362 ; Val Accuracy: 0.9633333333333333 CPU times: user 21min 51s, sys: 8min 6s, total: 29min 57s Wall time: 30min 8s
After only a single epoch, we achieve an accuracy of 95.93% on our validation set and get 96.33% after 2 epochs utilizing both our GPUs.
6. Visualizing Model Performance
We can also plot our training vs validation accuracy using the history variable.
We’ll define a helper function to get the predictions from our model:
This is similar to the evaluation function, except that we’re storing the text of the reviews and the predicted probabilities:
Let’s have a look at the classification report and the confusion matrix.
The confusion matrix shows us that the model makes mistakes in classifying both positive and negative reviews roughly equally.
In this article, we primarily intend to explain splitting the roberta-large layers over multiple GPUs to implement Model Parallelism. A point to keep in mind is that, since both the encoder and classifier layers of the roberta model is moved to GPU 1, it is likely that more memory would be utilized on GPU 1 while training.
The transformers is becoming an essential tool for NLP and the pretrained models are getting bigger in size and better in performance.
We are also working on splitting the layers in a way that consumes almost equal memory on both the GPUs, but until then, this tutorial would be a good starter for splitting the architecture as per requirement.
Here’s the link to the entire notebook.
 Hugging Face, Transformers GitHub (Nov 2019), https://github.com/huggingface/transformers
 Sentiment Analysis with BERT and Transformers by Hugging Face using PyTorch and Python
Sentiment Analysis with BERT and Transformers by Hugging Face using PyTorch and Python
TL;DR In this tutorial, you'll learn how to fine-tune BERT for sentiment analysis. You'll do the required text…
 Fastai with 🤗Transformers (BERT, RoBERTa, XLNet, XLM, DistilBERT)