Model Parallelism vs Data Parallelism in Unet speedup

Published in

Deelvin Machine Learning

5 min readJun 29, 2021

Introduction

The situation of increasing volume of accumulated data that we witness these days has two major implications for data scientists. On the positive side, this means that we can train Machine Learning (ML) algorithms on more data and make their predictions better. The downside of a large amount of data, however, is the increasing training time and, consequently, fewer hypotheses being tested due to the model taking too long to train.

Distributed ML as a new form of ML is one of the solutions to speeding up learning for neural networks. The essence of Distributed ML is the use of multiple GPUs for training. There are two approaches within Distributed ML: Data Parallelism and Model Parallelism.

In this article we describe an experiment that compares the speedup of Data Parallelism and Model Parallelism approaches using the Unet training model. We will first describe each of the Distributed ML approaches and then we will present the results of the comparative experiment.

Data parallelism

In Data Parallelism, the dataset is divided into N parts (where N is the number of GPUs, in the figure above N = 4). Then a copy of the model is placed on each GPU and trained on the corresponding data chunk. After that, gradients are calculated for each copy of the model, and then all the models exchange the gradients, which are subsequently averaged.

There are 2 implementations of Data Parallelism: Parameter server (PS) & Ring-allreduce. The idea behind the Parameter Server is that we have agents and a Parameter Server is a node that is responsible for synchronizing the gradients. The main disadvantage of this implementation of Data Parallelism is that the PS channel bandwidth capacity decreases as agents are added to training.

Ring-allreduce is designed to solve this problem because in this approach the channel bandwidth capacity does not depend on the number of agents participating in the training.

Data parallelism implementation in PyTorch

The PyTorch Tutorial discusses two implementations: Data Parallel and Distributed Data Parallel. The difference between them is that the first method is multithreaded while the second one is multiprocessor. PyTorch recommends using the second method, therefore we will focus on it.

In general terms, to switch from single gpu to DistributedDataParallel training, you need to:

Install process groups
Fix the random seed so that learning on all devices is the same.
Split the dataset into N equal parts using DistributedSampler
Wrap the model in DistributedDataParallel
Run the code using the torch.distributed.launch utility (or torch.distributed.run for PyTorch 1.9.0+)

These steps may sound daunting, but you can perform them all by adding several lines of a code. You can verify this by comparing the code with single_gpu and data_parallel from the repository with this experiment.

Model parallelism

In Model Parallelism, one model is divided into N parts (where N is equal to the number of GPUs, in the figure above N = 4), each part of the model is placed on a separate GPU, then the batch is sequentially calculated on GPU#0, GPU#1, …, GPU#N … This is where forward propagation ends. Backward propagation is done in reverse order, starting with GPU#N and ending with GPU#0.

The obvious advantage of this approach is that we can train a model in this way that does not fit into one GPU. However, there is also a drawback. In Model Parallelism, while computing on GPU#i is in progress, all the rest are idle. This problem is solved by switching to asynchronous GPU operation (more about it in the next section).

Model parallelism implementation in PyTorch

In general terms, to switch from single gpu to Model Parallelism training using asynchrony, you need to:

Divide the model into N parts (where N is the number of GPUs)
Change the forward method so that forward propagation occurs asynchronously on each GPU, to minimize downtime.

Dividing the model into N parts can be tricky because it is not always obvious which splitting will increase the performance and which one will decrease it. In the experiment described in this article, an enumeration of all possible options was made to find the fastest one.

To increase the efficiency of GPU use, processing in Model Parallelism is done asynchronously and for this, the split_size parameter is required, which splits the batch into several parts. Without this, the GPU will simply be idle for significant time. For example, in the PyTorch Tutorial, it was shown that without asynchrony, Model Parallelism worked 7% longer than single gpu, and with asynchrony it began to work almost 50% faster.

The split_size parameter must be iterated over to find the one at which the model’s performance will be the fastest. Given that, as a rule, there are several options for splitting a model, you will have to select split_size for each option separately. This makes using Model Parallelism more complex than using Data Parallelism.

Experiment

Currently, we often work with Unet architecture. For this reason, we decided to compare the two approaches to Distributed ML mentioned above using this architecture as an example.

We chose Unet with resnet34 backbone, which solves the binary semantic segmentation problem, and trained it in three ways: Single gpu, Data Parallelism, and Model Parallelism. Then the average time per training epoch was compared for each of these approaches. The results of the experiments are shown in Table 1 and Table 2. The source code of the experiments is available in the repository.

For the experiments, we had at our disposal a machine with 2 GPUs of the following configuration:

Two GPU NVIDIA RTX 2070 SUPER (8 GB each)
Intel Core i9–9900K CPU (8 cores)
Ubuntu 18.04
Nvidia driver: 440.33.01
CUDA 10.2
CUDNN 7.6.5
PyTorch 1.8.1

Results

Table 2. Speedup of the learning epoch for Model Parallelism and Data Parallelism

As data in Table 2 demonstrate, Data Parallelism gives greater speedup for Unet than Model Parallelism (they are both faster than Single gpu by about 30% and 4% respectively).

In addition, it turned out to be much easier to migrate the pipeline to Data Parallelism, since this required changing less than 10 lines of code, and for Model Parallelism I had to iterate over the options for splitting the model and iterate over the split_size parameter.

Nevertheless, if the task of training a complex (greater that 1 GPU) Unet arises, Model Parallelism remains the only option for solving this problem (except for the option of buying a GPU with a large amount of memory).

We hope the results of our experiment will be useful to you. Feel free to get in touch and ask questions.