Multi-GPU Training in PyTorch with Code (Part 2): Data Parallel

Published in

Polo Club of Data Science | Georgia Tech

7 min readJul 7, 2023

In Part 1, we successfully trained a ResNet34 on CIFAR10 using a single GPU. In this article, we will explore how to launch the training on multiple GPUs using Data Parallel (DP).

Part 1. Single GPU Example — Training ResNet34 on CIFAR10

Part2. Data Parallel (this article)— Training code & issue between DP and NVLink

Part3. Distributed Data Parallel (this article) — Training code & Analysis

Part4. Torchrun — Fault tolerance

Before jumping into DP, I would like to discuss the setting of my machine. I have 8 GPUs and below shows the topology between each pair of GPUs. Only GPU 0&1, 2&3, 4&5, and 6&7 have high-speed NVLinks. All other data has to traverse the PCIe. This info will be useful while we are debugging DP below.

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0     X      NV4     NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0
GPU1    NV4      X      NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0
GPU2    NODE    NODE     X      NV4     SYS     SYS     SYS     SYS     0-31,64-95      0
GPU3    NODE    NODE    NV4      X      SYS     SYS     SYS     SYS     0-31,64-95      0
GPU4    SYS     SYS     SYS     SYS      X      NV4     NODE    NODE    32-63,96-127    1
GPU5    SYS     SYS     SYS     SYS     NV4      X      NODE    NODE    32-63,96-127    1
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NV4     32-63,96-127    1
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NV4      X      32-63,96-127    1

Legend:  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Multi-GPU Data Parallel

What is data parallel?
The idea of data parallel is copying the same model weights into multiple workers and assigning a fraction of data to each worker to be processed at the same time.
How is it implemented in PyTorch?
DP splits the input across the specified devices by chunking in the batch dimension. In the forward pass, the module is replicated on each device, and each replica handles a portion of the input. During the backward pass, gradients from each replica are summed into the original module. For more info, please check the official API.

It’s just one line of code to wrap up your model in DP.

model = torch.nn.parallel.DataParallel(model)

TrainerDP — Wrapping up our single model with DP. Note we set gpu_id = “cuda” so that DP can use all available GPUs. TrainerSingle is defined in Part 1 of this series.

from torch.nn.parallel import DataParallel

class TrainerDP(TrainerSingle):
    def __init__(
        self,
        model: nn.Module,
        trainloader: DataLoader,
        testloader: DataLoader,
    ):
        self.gpu_id = "cuda"
        super().__init__(self.gpu_id, model, trainloader, testloader) 
        
        self.model = DataParallel(self.model)         def _save_checkpoint(self, epoch: int):
        ckp = self.model.state_dict()
        model_path = self.const["trained_models"] / f"CIFAR10_dp_epoch{epoch}.pt"
        torch.save(ckp, model_path)

Main function. It’s exactly the same code as using a single GPU.

def main_dp(final_model_path: str):
    const = prepare_const()
    train_dataset, test_dataset = cifar_dataset(const["data_root"])
    train_dataloader, test_dataloader = cifar_dataloader_single(
        train_dataset, test_dataset, const["batch_size"]
    )
    model = cifar_model()
    trainer = TrainerDP(
        model=model,
        trainloader=train_dataloader,
        testloader=test_dataloader,
    )
    trainer.train(const["total_epochs"])
    trainer.test(final_model_path)

Experiments

if __name__ == "__main__":
    final_model_path = Path("./trained_models/CIFAR10_dp_epoch14.pt")
    main_dp(final_model_path)

Output

$ CUDA_VISIBLE_DEVICES=6,7 python main.py 
Files already downloaded and verified
Files already downloaded and verified
------------------------------------------------------------------------------------------
[GPUcuda] Epoch  0 | Batchsize: 128 | Steps: 391 | LR: 0.1000 | Loss: 2.0683 | Acc: 26.60%
------------------------------------------------------------------------------------------
[GPUcuda] Epoch  1 | Batchsize: 128 | Steps: 391 | LR: 0.1000 | Loss: 1.4562 | Acc: 46.30%
------------------------------------------------------------------------------------------
[GPUcuda] Epoch  2 | Batchsize: 128 | Steps: 391 | LR: 0.1000 | Loss: 1.2057 | Acc: 56.26%
------------------------------------------------------------------------------------------
[GPUcuda] Epoch  3 | Batchsize: 128 | Steps: 391 | LR: 0.1000 | Loss: 0.9887 | Acc: 64.74%
------------------------------------------------------------------------------------------
[GPUcuda] Epoch  4 | Batchsize: 128 | Steps: 391 | LR: 0.0100 | Loss: 0.8248 | Acc: 70.56%
------------------------------------------------------------------------------------------
[GPUcuda] Epoch  5 | Batchsize: 128 | Steps: 391 | LR: 0.0100 | Loss: 0.5498 | Acc: 80.81%
------------------------------------------------------------------------------------------
[GPUcuda] Epoch  6 | Batchsize: 128 | Steps: 391 | LR: 0.0100 | Loss: 0.4616 | Acc: 83.93%
------------------------------------------------------------------------------------------
[GPUcuda] Epoch  7 | Batchsize: 128 | Steps: 391 | LR: 0.0100 | Loss: 0.3924 | Acc: 86.34%
------------------------------------------------------------------------------------------
[GPUcuda] Epoch  8 | Batchsize: 128 | Steps: 391 | LR: 0.0100 | Loss: 0.3204 | Acc: 88.99%
------------------------------------------------------------------------------------------
[GPUcuda] Epoch  9 | Batchsize: 128 | Steps: 391 | LR: 0.0010 | Loss: 0.2484 | Acc: 91.59%
------------------------------------------------------------------------------------------
[GPUcuda] Epoch 10 | Batchsize: 128 | Steps: 391 | LR: 0.0010 | Loss: 0.1450 | Acc: 95.77%
------------------------------------------------------------------------------------------
[GPUcuda] Epoch 11 | Batchsize: 128 | Steps: 391 | LR: 0.0010 | Loss: 0.1216 | Acc: 96.52%
------------------------------------------------------------------------------------------
[GPUcuda] Epoch 12 | Batchsize: 128 | Steps: 391 | LR: 0.0010 | Loss: 0.1070 | Acc: 96.94%
------------------------------------------------------------------------------------------
[GPUcuda] Epoch 13 | Batchsize: 128 | Steps: 391 | LR: 0.0010 | Loss: 0.0951 | Acc: 97.32%
------------------------------------------------------------------------------------------
[GPUcuda] Epoch 14 | Batchsize: 128 | Steps: 391 | LR: 0.0001 | Loss: 0.0856 | Acc: 97.63%
[GPUcuda] Test Acc: 74.7900%

As mentioned above, GPU 6&7 has the NVLink. Let’s run the same code using a group of GPUs with at least one PCIe connection, e.g., GPU 5&6.

$ CUDA_VISIBLE_DEVICES=5,6 python main.py 
Files already downloaded and verified
Files already downloaded and verified
------------------------------------------------------------------------------------------
[GPUcuda] Epoch  0 | Batchsize: 128 | Steps: 391 | LR: 0.1000 | Loss: nan | Acc: 10.00%
------------------------------------------------------------------------------------------
[GPUcuda] Epoch  1 | Batchsize: 128 | Steps: 391 | LR: 0.1000 | Loss: nan | Acc: 9.99%

...

The code execution speed is significantly slower as expected since GPU 5&6 traverses through PCIe. The full output is omitted since the loss is NaN in the first epoch. This is unexpected since we are launching the same code. The issue is also reported in the PyTorch forum.

We verify this phenomenon from another GitHub repo of training CIFAR10 models, which also uses DP for multi-GPU training. Below we show the results of running the code on GPUs w/ and w/o NVLink. We just partially print the output of the first epoch below, which is sufficient to make observations.

$ CUDA_VISIBLE_DEVICES=6,7 python main.py --lr 0.1 
==> Preparing data..
Files already downloaded and verified
Files already downloaded and verified
==> Building model..

Epoch: 0
[>................................................................]  Step: 6s679ms | Tot: 0ms | Loss: 2.34 
[>................................................................]  Step: 90ms | Tot: 90ms | Loss: 2.321  
[>................................................................]  Step: 43ms | Tot: 133ms | Loss: 2.349 
[>................................................................]  Step: 41ms | Tot: 175ms | Loss: 2.353 
[>................................................................]  Step: 43ms | Tot: 218ms | Loss: 2.364 
[>................................................................]  Step: 44ms | Tot: 263ms | Loss: 2.390 
[>................................................................]  Step: 40ms | Tot: 304ms | Loss: 2.423 
[=>...............................................................]  Step: 42ms | Tot: 346ms | Loss: 2.412 
[=>...............................................................]  Step: 44ms | Tot: 391ms | Loss: 2.399 
[=>...............................................................]  Step: 45ms | Tot: 436ms | Loss: 2.426 
[=>...............................................................]  Step: 44ms | Tot: 481ms | Loss: 2.410...$ CUDA_VISIBLE_DEVICES=5,6 python main.py --lr 0.1 
==> Preparing data..
Files already downloaded and verified
Files already downloaded and verified
==> Building model..Epoch: 0
[>................................................................]  Step: 7s169ms | Tot: 0ms | Loss: 2.33 
[>................................................................]  Step: 236ms | Tot: 236ms | Loss: 2.33 
[>................................................................]  Step: 203ms | Tot: 439ms | Loss: 11.9 
[>................................................................]  Step: 203ms | Tot: 643ms | Loss: 3640 
[>................................................................]  Step: 200ms | Tot: 843ms | Loss: nan  
[>................................................................]  Step: 203ms | Tot: 1s47ms | Loss: nan 
[>................................................................]  Step: 203ms | Tot: 1s251ms | Loss: nan...

Apparently, the loss explodes all the way to NaN after switching to GPUs w/o NVLink. As DP is summing all the gradients on multiple GPUs, we test whether reducing lr can help the training converge.

$ CUDA_VISIBLE_DEVICES=5,6 python main.py --lr 0.05 
==> Preparing data..
Files already downloaded and verified
Files already downloaded and verified
==> Building model..

Epoch: 0
[>................................................................]  Step: 6s842ms | Tot: 0ms | Loss: 2.29 
[>................................................................]  Step: 305ms | Tot: 305ms | Loss: 2.30 
[>................................................................]  Step: 203ms | Tot: 509ms | Loss: 15.5 
[>................................................................]  Step: 202ms | Tot: 711ms | Loss: 6728 
[>................................................................]  Step: 203ms | Tot: 914ms | Loss: 3828 
[>................................................................]  Step: 203ms | Tot: 1s118ms | Loss: nan 
[>................................................................]  Step: 202ms | Tot: 1s320ms | Loss: nan 
[=>...............................................................]  Step: 203ms | Tot: 1s523ms | Loss: nan...$ CUDA_VISIBLE_DEVICES=5,6 python main.py --lr 0.001 
==> Preparing data..
Files already downloaded and verified
Files already downloaded and verified
==> Building model..Epoch: 0
[>................................................................]  Step: 6s790ms | Tot: 0ms | Loss: 2.335 
[>................................................................]  Step: 224ms | Tot: 224ms | Loss: 2.315 
[>................................................................]  Step: 201ms | Tot: 425ms | Loss: 4.280 
[>................................................................]  Step: 200ms | Tot: 625ms | Loss: 6.375 
[>................................................................]  Step: 200ms | Tot: 826ms | Loss: 7.297 
[>................................................................]  Step: 201ms | Tot: 1s27ms | Loss: 11.276 
[>................................................................]  Step: 200ms | Tot: 1s228ms | Loss: 13.932 
[=>...............................................................]  Step: 200ms | Tot: 1s429ms | Loss: 12.478 
[=>...............................................................]  Step: 203ms | Tot: 1s632ms | Loss: 11.350 
[=>...............................................................]  Step: 200ms | Tot: 1s833ms | Loss: 10.503...

Since we use two GPUs, we first reduce the lr by half. However, the training diverges rapidly after a few steps. Only when we reduce the lr to 0.001, which is 1/100 of the original lr, the training starts to converge, but the loss is still gradually increasing. At the time of writing, it’s still unclear how this happens, and whether this is a hardware and software mismatch between DP and NVLink. In the next article (Part 3), we will bypass the issue via DDP.

Multi-GPU Training in PyTorch with Code (Part 2): Data Parallel

Multi-GPU Data Parallel

Experiments

Written by Anthony Peng