Multi-GPU Training in PyTorch with Code (Part 1): Single GPU Example

Anthony Peng
Polo Club of Data Science | Georgia Tech
5 min readJul 7, 2023

This tutorial series will cover how to launch your deep learning training on multiple GPUs in PyTorch. We will discuss how to extrapolate a single GPU training example to multiple GPUs via Data Parallel (DP) and Distributed Data Parallel (DDP), compare the performance, analyze details inside DDP distributed sampler, and ensure fault tolerance with torchrun. I spent tons of time going over the official tutorial, API manual, and other blogs to ensure the correctness of this tutorial. Feel free to comment below if anything is unclear to you.

Part 1. Single GPU Example (this article) — Training ResNet34 on CIFAR10

Part2. Data Parallel — Training code & issue between DP and NVLink

Part3. Distributed Data Parallel — Training code & Analysis

Part4. Torchrun — Fault tolerance

In this article, we provide an example of training ResNet34 on CIFAR10 with a single GPU. If any of the below code is unfamiliar to you, please check the official tutorial on PyTorch Basics.

Basics

Necessary packages & hyperparameters. All trained models are saved at “./trained_models”, and the CIFAR10 dataset is saved at “./data”. These hyperparameters will stay the same during multi-GPU training.

from pathlib import Path
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torchvision.models import resnet34
from torchvision.transforms import transforms
from torchvision.datasets import CIFAR10
import torch.optim as optim
from torch import Tensor
from typing import Iterator, Tuple
import torchmetrics

def prepare_const() -> dict:
"""Data and model directory + Training hyperparameters"""
data_root = Path("data")
trained_models = Path("trained_models")

if not data_root.exists():
data_root.mkdir()

if not trained_models.exists():
trained_models.mkdir()

const = dict(
data_root=data_root,
trained_models=trained_models,
total_epochs=15,
batch_size=128,
lr=0.1, # learning rate
momentum=0.9,
lr_step_size=5,
save_every=3,
)

return const

ResNet34 model. Torchvision ResNets are defined for ImageNet, which has a higher resolution than CIFAR10, so we replace the stem stage with a smaller kernel_size (7 to 3) and remove the maxpooling layer.

def cifar_model() -> nn.Module:
model = resnet34(num_classes=10)
model.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1, bias=False)
model.maxpool = nn.Identity()
return model

CIFAR10 dataset.

def cifar_dataset(data_root: Path) -> Tuple[Dataset, Dataset]:
transform = transforms.Compose(
[
transforms.ToTensor(),
transforms.Normalize(
mean=(0.49139968, 0.48215827, 0.44653124),
std=(0.24703233, 0.24348505, 0.26158768),
),
]
)

trainset = CIFAR10(root=data_root, train=True, transform=transform, download=True)
testset = CIFAR10(root=data_root, train=False, transform=transform, download=True)

return trainset, testset

Single GPU Training

Dataloader for single gpu

def cifar_dataloader_single(
trainset: Dataset, testset: Dataset, bs: int
) -> Tuple[DataLoader, DataLoader]:
trainloader = DataLoader(trainset, batch_size=bs, shuffle=True, num_workers=8)
testloader = DataLoader(testset, batch_size=bs, shuffle=False, num_workers=8)

return trainloader, testloader

Trainer class. We use torchmetrics to compute the classification accuracy since it supports distributed scenarios. We will verify its correctness in the DDP article. Note torchmetrics.Accuracy contain parameters. so it has to be on GPU. The code is pretty straightforward as _run_batch takes care of each batch and _run_epoch takes care of each epoch. The lr_scheduler decreases the learning rate (lr) by 10 for every 5 epochs.

class TrainerSingle:
def __init__(
self,
gpu_id: int,
model: nn.Module,
trainloader: DataLoader,
testloader: DataLoader,
):
self.gpu_id = gpu_id

self.const = prepare_const()
self.model = model.to(self.gpu_id)
self.trainloader = trainloader
self.testloader = testloader
self.criterion = nn.CrossEntropyLoss()
self.optimizer = optim.SGD(
self.model.parameters(),
lr=self.const["lr"],
momentum=self.const["momentum"],
)
self.lr_scheduler = optim.lr_scheduler.StepLR(
self.optimizer, self.const["lr_step_size"]
)
self.train_acc = torchmetrics.Accuracy(
task="multiclass", num_classes=10, average="micro"
).to(self.gpu_id)

self.valid_acc = torchmetrics.Accuracy(
task="multiclass", num_classes=10, average="micro"
).to(self.gpu_id)

def _run_batch(self, src: Tensor, tgt: Tensor) -> float:
self.optimizer.zero_grad()

out = self.model(src)
loss = self.criterion(out, tgt)
loss.backward()
self.optimizer.step()

self.train_acc.update(out, tgt)
return loss.item()

def _run_epoch(self, epoch: int):
loss = 0.0
for src, tgt in self.trainloader:
src = src.to(self.gpu_id)
tgt = tgt.to(self.gpu_id)
loss_batch = self._run_batch(src, tgt)
loss += loss_batch
self.lr_scheduler.step()

print(
f"{'-' * 90}\n[GPU{self.gpu_id}] Epoch {epoch:2d} | Batchsize: {self.const['batch_size']} | Steps: {len(self.trainloader)} | LR: {self.optimizer.param_groups[0]['lr']:.4f} | Loss: {loss / len(self.trainloader):.4f} | Acc: {100 * self.train_acc.compute().item():.2f}%",
flush=True,
)

self.train_acc.reset()

def _save_checkpoint(self, epoch: int):
ckp = self.model.state_dict()
model_path = self.const["trained_models"] / f"CIFAR10_single_epoch{epoch}.pt"
torch.save(ckp, model_path)

def train(self, max_epochs: int):
self.model.train()
for epoch in range(max_epochs):
self._run_epoch(epoch)
if epoch % self.const["save_every"] == 0:
self._save_checkpoint(epoch)
# save last epoch
self._save_checkpoint(max_epochs - 1)

def test(self, final_model_path: str):
self.model.load_state_dict(torch.load(final_model_path))
self.model.eval()
with torch.no_grad():
for src, tgt in self.testloader:
src = src.to(self.gpu_id)
tgt = tgt.to(self.gpu_id)
out = self.model(src)
self.valid_acc.update(out, tgt)
print(
f"[GPU{self.gpu_id}] Test Acc: {100 * self.valid_acc.compute().item():.4f}%"
)

Main function. Finally, we load all hyperparameters, load the dataset and dataloader, and train the model. After training, we test the data on the testset. Note that CIFAR10 has 50,000 training samples and 10,000 testing samples.

def main_single(gpu_id: int, final_model_path: str):
const = prepare_const()
train_dataset, test_dataset = cifar_dataset(const["data_root"])
train_dataloader, test_dataloader = cifar_dataloader_single(
train_dataset, test_dataset, const["batch_size"]
)
model = cifar_model()
trainer = TrainerSingle(
gpu_id=gpu_id,
model=model,
trainloader=train_dataloader,
testloader=test_dataloader,
)
trainer.train(const["total_epochs"])
trainer.test(final_model_path)

Experiments

It’s time to roll! The model is trained for 15 epochs, and the weights are saved after every 3 epochs.

if __name__ == "__main__":
gpu_id = 0
final_model_path = Path("./trained_models/CIFAR10_single_epoch14.pt")
main_single(gpu_id, final_model_path)

Output

$ CUDA_VISIBLE_DEVICES=0 python main.py 
Files already downloaded and verified
Files already downloaded and verified
------------------------------------------------------------------------------------------
[GPU0] Epoch 0 | Batchsize: 128 | Steps: 391 | LR: 0.1000 | Loss: 2.0430 | Acc: 25.20%
------------------------------------------------------------------------------------------
[GPU0] Epoch 1 | Batchsize: 128 | Steps: 391 | LR: 0.1000 | Loss: 1.3259 | Acc: 51.54%
------------------------------------------------------------------------------------------
[GPU0] Epoch 2 | Batchsize: 128 | Steps: 391 | LR: 0.1000 | Loss: 1.0207 | Acc: 63.55%
------------------------------------------------------------------------------------------
[GPU0] Epoch 3 | Batchsize: 128 | Steps: 391 | LR: 0.1000 | Loss: 0.8059 | Acc: 71.22%
------------------------------------------------------------------------------------------
[GPU0] Epoch 4 | Batchsize: 128 | Steps: 391 | LR: 0.0100 | Loss: 0.6558 | Acc: 76.87%
------------------------------------------------------------------------------------------
[GPU0] Epoch 5 | Batchsize: 128 | Steps: 391 | LR: 0.0100 | Loss: 0.3658 | Acc: 87.46%
------------------------------------------------------------------------------------------
[GPU0] Epoch 6 | Batchsize: 128 | Steps: 391 | LR: 0.0100 | Loss: 0.2757 | Acc: 90.42%
------------------------------------------------------------------------------------------
[GPU0] Epoch 7 | Batchsize: 128 | Steps: 391 | LR: 0.0100 | Loss: 0.2090 | Acc: 92.95%
------------------------------------------------------------------------------------------
[GPU0] Epoch 8 | Batchsize: 128 | Steps: 391 | LR: 0.0100 | Loss: 0.1468 | Acc: 95.16%
------------------------------------------------------------------------------------------
[GPU0] Epoch 9 | Batchsize: 128 | Steps: 391 | LR: 0.0010 | Loss: 0.0967 | Acc: 97.10%
------------------------------------------------------------------------------------------
[GPU0] Epoch 10 | Batchsize: 128 | Steps: 391 | LR: 0.0010 | Loss: 0.0514 | Acc: 98.65%
------------------------------------------------------------------------------------------
[GPU0] Epoch 11 | Batchsize: 128 | Steps: 391 | LR: 0.0010 | Loss: 0.0407 | Acc: 98.95%
------------------------------------------------------------------------------------------
[GPU0] Epoch 12 | Batchsize: 128 | Steps: 391 | LR: 0.0010 | Loss: 0.0366 | Acc: 99.14%
------------------------------------------------------------------------------------------
[GPU0] Epoch 13 | Batchsize: 128 | Steps: 391 | LR: 0.0010 | Loss: 0.0339 | Acc: 99.21%
------------------------------------------------------------------------------------------
[GPU0] Epoch 14 | Batchsize: 128 | Steps: 391 | LR: 0.0001 | Loss: 0.0319 | Acc: 99.23%
[GPU0] Test Acc: 78.3000%

Cool! We have successfully trained a ResNet34 on CIFAR10 with a single GPU. In the next article, we will explore how to leverage multiple GPUs to accelerate our training.

--

--