Model Parallelism using Transformers and PyTorch

Sakthi Ganesh
Jan 26 · 6 min read

Taking advantage of multiple GPUs to train larger models such as RoBERTa-Large on NLP datasets

This article is co-authored by Saichandra Pandraju.

This tutorial will help you implement Model Parallelism (splitting the model layers into multiple GPUs) to help train larger models over multiple GPUs. We would be using the RoBERTa-Large model from Hugging Face Transformers. This approach helps achieve Model Parallelism just with PyTorch and without using any PyTorch wrappers such as Pytorch-Lightning.

Lastly, we would be using the IMDB dataset of 50K Movie Reviews for fine-tuning our RoBERTa-Large model


First, you will need a machine/VM with multiple GPUs. You can ensure multiple GPUs are available using the nvidia-smi command.

Note: Though we have 2 * 32GB of GPU available, and we should be able to fine-tune the RoBERTa-Large model on a single 32GBB GPU with lower batch sizes, we wanted to share this idea/method to help people with lesser available GPU.

Then, you will need to install the transformers and pytorch libraries. To do so, just follow the instructions here and here.

The version of the libraries used for this demonstration are transformers==4.2.2 and torch==1.7.0

The IMDB Dataset can be downloaded from Kaggle and notebook is available on my GitHub.

IMDB Dataset Task

For each text movie review, the model has to predict a label for the sentiment. We evaluate the outputs of the model on classification accuracy.

0 → Negative

1 → Positive

1. Loading the Data

The data is loaded into the dataframe using pandas. We are loading only the top 10k samples in account of training time.

Also, we convert the values in sentiment column from positive and negative to 1 and 0 respectively and drop the existing sentiment column.

2. Initializing RoBERTa Model from Transformers

As mentioned in Maximilien Roberti’s article, In transformers, each model architecture is associated with three main types of classes.

  • A model class to load/store a particular pre-train model.
  • A tokenizer class to pre-process the data and make it compatible with a particular model.
  • A configuration class to load/store the configuration of a particular model.

For example, if you want to use the RoBERTa architecture for text classification, you would use RobertaForSequenceClassification for the model class, RobertaTokenizer for the tokenizer class and RobertaConfig for the configuration class.

Later, you will see that those classes share a common class method from_pretrained(pretrained_model_name, ...). In our case, the parameter pretrained_model_name is a string with the shortcut name of a pre-trained model/tokenizer/configuration to load, e.g roberta-large. We can find all the shortcut names in the transformers documentation here.

3. Creating Torch Dataset and DataLoader

Now, we create the building blocks to create a PyTorch dataset.

The tokenizer does most of the heavy lifting for us. We also return the return_text, so it’ll be easier to calculate the predictions from our model.

Now, lets split the data into train and val sets:

We also need to create a couple of dataloaders.

4. Create the Multi GPU Classifier

In this step, we will define our model architecture. We create a custom method since we’re interested in splitting the roberta-large layers across the 2 available GPUs.

The roberta-large model consists of 1 Embedding, 23 layers under Encoder and a 1 Classifier (RobertaClassificationHead) Layer.

In this implemention, we split the layers of the roberta-large model in the below mentioned fashion:

  • Embedding Layer → cuda : 0
  • Encoder Layers → cuda : 1
  • Classifier Layer → cuda : 1

Note: Since all the 23 layers under Encoder is present in the GPU 1, it is likely that more of GPU 1’s resources would be utilized while training. We are working on figuring out a way to split the individual encoder layers to equalize GPU usage across the GPUs.

Upon initializing our model, we should be able to see the memory being used from both the GPU’s instead of single GPU using nvidia-smi command.

5. Training the Model

We’ll use the AdamW optimizer provided by Hugging Face. It corrects weight decay, so it’s similar to the original BERT paper. We’ll also use a linear scheduler with no warmup steps along with Cross Entropy Loss as the loss function.

Now, we create a helper function for training our model:

and another helper function for evaluating our model:

Using the training and evaluation helper functions, we can create our training loop. We’ll also store our training history:

Epoch 1/2 
Train Loss: 0.3271573152278683 ; Train Accuracy: 0.8830000000000001 Val Loss: 0.19180121847156745 ; Val Accuracy: 0.9593333333333333
Epoch 2/2
Train Loss: 0.14536982661844897 ; Train Accuracy: 0.9690000000000001 Val Loss: 0.2031460166494362 ; Val Accuracy: 0.9633333333333333
CPU times: user 21min 51s, sys: 8min 6s, total: 29min 57s Wall time: 30min 8s

After only a single epoch, we achieve an accuracy of 95.93% on our validation set and get 96.33% after 2 epochs utilizing both our GPUs.

6. Visualizing Model Performance

We can also plot our training vs validation accuracy using the history variable.

Training vs Validation Accuracy

We’ll define a helper function to get the predictions from our model:

This is similar to the evaluation function, except that we’re storing the text of the reviews and the predicted probabilities:

Let’s have a look at the classification report and the confusion matrix.

Classification Report
Confusion Matrix

The confusion matrix shows us that the model makes mistakes in classifying both positive and negative reviews roughly equally.


In this article, we primarily intend to explain splitting the roberta-large layers over multiple GPUs to implement Model Parallelism. A point to keep in mind is that, since both the encoder and classifier layers of the roberta model is moved to GPU 1, it is likely that more memory would be utilized on GPU 1 while training.

The transformers is becoming an essential tool for NLP and the pretrained models are getting bigger in size and better in performance.

We are also working on splitting the layers in a way that consumes almost equal memory on both the GPUs, but until then, this tutorial would be a good starter for splitting the architecture as per requirement.

Here’s the link to the entire notebook.


[1] Hugging Face, Transformers GitHub (Nov 2019),

[2] Sentiment Analysis with BERT and Transformers by Hugging Face using PyTorch and Python

[3] Fastai with 🤗Transformers (BERT, RoBERTa, XLNet, XLM, DistilBERT)


ML Enthusiast looking to solve real world problems and make AI/ML more efficient and accessible

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store