Training YOLOv5 on AWS with PyTorch and SageMaker Distributed Data Parallel Library

8 min readMay 6, 2022

A step-by-step tutorial to train the PyTorch YOLOv5 model on Amazon SageMaker using the SageMaker distributed data parallel library.

Object detection is a computer vision task where the goal is to identify and locate objects in pictures and videos. It has a wide range of applications today. For example, in autonomous driving, it can be used to allow the vehicle to recognize lanes, obstacles, traffic signs, and pedestrians to make critical decisions. The list of applications of object detection is long: medical imaging, surveillance systems, traffic monitoring, and so on.

State-of-the-art deep-neural-network-based techniques, such as YOLOv5 (You Only Look Once), provide significant improvement in both accuracy and inference speed of models. However, the training process can be computationally expensive, taking significant amount of time and compute resources. The longer the time to train a model, the longer cycles for experimentation and model development, resulting in an increased cost for training. One way to train faster is to use a cluster of machines and GPUs to parallelize your training tasks, which greatly speeds up the development process and brings your project to fruition much faster.

Luckily, at AWS, we have the SageMaker machine learning platform that makes access to such resources convenient and pain-free. The EC2 p4d instances have 96 vCPUs, 8 A100 GPUs, and 400GB/s network bandwidth. What’s more, our team at AWS have developed the SageMaker Distributed Data Parallel (SMDDP) library. It exploits the characteristics of AWS infrastructure and network topology to deliver near-linear scaling efficiency when you train using multiple nodes in a cluster. This is possible through the library’s highly-optimized inter-node communication mechanism. You can learn more about SMDDP in its publication at Herring: Rethinking the parameter server at scale for the cloud.

In this article, we are going to use the YOLOv5 model and show how easily you can adapt the source code to train on the Amazon SageMaker platform using our SMDDP PyTorch Distributed backend. This tutorial also explores the scaling efficiency of distributed training for up to 8 machines and impact of using variable number of workers & batch size on model training speed.

YOLOv5 Overview

YOLOv5 is a representative 1-stage detector (as opposed to the 2-stage region-based detectors like Mask-RCNN that separate regional proposals and classification into two different steps). Since its origination in 2015 by Joseph Redmon with implementation in Darknet, it has gone through a series of enhancements. YOLOv5 is the latest version of the model authored by Glenn Jocher who implemented it using PyTorch. It now has a mainstream framework implementation and is more accessible than ever. For a deep dive into its architectural contributions, this article by Roboflow makes an excellent read.

Three of the important metrics to evaluate the performance of object detection models are mAP (mean average precision, whether it is accurate enough), FPS (frames per second, whether it runs inference fast enough in real-time use cases), and model size (whether it is portable enough in different environments like mobile devices, etc).

From the source repo, we can see how different variants of the model compare and how you can decide on the variant to use based on the tradeoff for your particular use-case in terms of mAP and FPS:

YOLOv5 models comparison. Image By Ultralytics Via github

Research has also shown that over different hardwares, YOLOv5 has an edge over the MobileNet SSD v2 both in terms of mAP and inference speed. Although it is less accurate than the YOLOv3 model, it has more than 20% higher FPS which makes it more suitable for real-time use cases.

Comparison Between YOLOv5 and Other Models on Different Hardwares. Image by Rakkshab Varadharajan Iyer et al in Comparison of YOLOv3, YOLOv5s and MobileNet-SSD V2 for Real-TimeMask Detection

In terms of model size, we downloaded the pre-trained weights and found that YOLOv5s is the smallest and takes only 14MB, whereas YOLOv5x is the largest and takes up to 166MB. Not to mention, you can also try to quantize as well as prune the weights to further reduce the model size before deploying, which makes it a highly portable object detector.

YOLOv5 Models Weight Size. Image by author.

Adapting YOLO v5 to use SMDDP for distributed model training

With a model that has existing implementations in PyTorch DDP, it is very easy to modify your scripts to using the SMDDP backend, following the Guide for PyTorch in the SageMaker Python SDK documentation. Below, we will highlight the most important changes to make in a PyTorch training script. See here for the complete code.

Using SMDDP as a PyTorch Distributed Backend

Simply import our module and set PyTorch Distributed backend to SMDDP. Make sure to comment out the original line of code which initializes the process group with NCCL backend.

import torch.distributed as dist
...
import smdistributed.dataparallel.torch.torch_smddp
dist.init_process_group(backend='smddp')
...
#dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo")

Make changes in the `barrier()` Call

Note that the device_id argument in the torch.distributed.barrier() call is only supported for the NCCL backend, and SMDDP does not support it. In ddp training the rank to gpu binding is one to one, so we don't need to specify the device_id. If you have set up your training script for the NCCL backend, make sure you remove the device_id arguments while adapting to use the SMDDP backend.

#dist.barrier(device_ids=[local_rank])
dist.barrier()
...
#dist.barrier(device_ids=[0])
dist.barrier()

Training YOLOv5 with SMDDP on SageMaker

We provide an example notebook that you can jump onto and start training the YOLOv5 model in SageMaker. You do not have to set up anything beside the step for uploading your dataset from Amazon S3 to Amazon FSx for Lustre. Note that the training script will generate cache files (*.cache) for the images that embed the absolute path to the dataset when downloaded to your local machine. Do not upload these cache files along with the original dataset to Amazon S3 because there will be a mismatch between the cached dataset path and the actual dataset path mounted onto the compute instances, which might cause problems.

Due to legal restrictions, we are unable to use the COCO dataset in our notebook example and instead use the BCCD dataset. But it is trivial to simply substitute the data.yaml file and data to reproduce the results in this blogpost.

Achieving a Faster Training Throughput

Parameter Settings for Apple-to-Apple Comparisons

Like any other experiments, to discuss about the results, we need to articulate the environment that we run the experiments under. We keep the following parameters consistent throughout the entire experiment.

AWS EC2 Instance type: ml.p4d.24xlarge
Model: YOLOv5x
Dataset: COCO2017 (in the Jupyter notebook we provide, the default dataset is BCCD for legal reasons. But it is easy to use the coco.yaml file instead as the data parameter to reproduce the results below)
PyTorch DDP backend used for comparison: NCCL
Automatic Mixed Precision (allows some operations to carry out calculations in half precision instead of full precision, which can speed up the computation time significantly)

We will experiment the following training parameters:

The number of nodes: 2, 4, and 8 ml.p4d.24xlarge instances
Data loading: num_workers and data source
The batch size per GPU

Metrics

We measure the total throughput (samples/second) and scaling efficiency to evaluate the training performance.

Experiment 1: Change the Number of Nodes and Measure Scaling Efficiency

The following table shows the result of scaling to the different number of GPU nodes using the SMDDP backend compared to the ones with the NCCL backend.

SMDDP v.s. NCCL YOLOv5x Training Throughput Numbers from 2 to 8 Nodes. Image by author.

SMDDP v.s. NCCL YOLOv5x Training Throughput Visualization from 2 to 8 Nodes. Image by author.

We can achieve a better near-linear scaling efficiency with SMDDP, which outperforms the NCCL’s scaling efficiency.

SMDDP v.s. NCCL YOLOv5x Training Scaling Efficiency from 2 to 8 Nodes. Image by author.

Experiment 2: Change the Number of Workers (`num_workers`) and Measure Data Loading Speed

With SageMaker training platform, it is recommended to use FSx as the data source. Be sure when setting up FSx, give it enough throughput so that its throughput does not become a bottleneck during data loading (we ran into this problem when doing experiments). A safe throughput limit for this model is 1440MB/s. At 240MB/s, FSx throughput was bottlenecked with the per-GPU batch size of 16. When we doubled it to 480MB/s, it was enough for the per-GPU batch size of 16 but was not enough for 32. Since we can set the per-GPU batch size to as large as 48 in this model, 480 * 3 = 1440MB/s would be a safe value.

It is also very important to tune the num_workers value for the PyTorch Dataloader (see this bug we discovered and fixed in the source repo as an example). While there is not a general formula to calculate an optimal number, a general guideline is to experiment between 0 and (# cpus / # gpus) to find the best fit for your hardware and model. And that is what we have done for this model:

SMDDP v.s. NCCL YOLOv5x Training Throughput Numbers w.r.t. num_workers. Image by author.

SMDDP v.s. NCCL YOLOv5x Training Throughput Visualization w.r.t. num_workers. Image by author.

A value of 0 for num_workers means that the training job does not use any extra thread to load the data and results in low throughput. From our experiments, we can see that once this number reaches 4, we do not really gain anything extra by further increasing the number of workers. Therefore we set num_workers = 4 for the other experiments.

Experiment 3: Change the Batch Size Per GPU and Measure Performance Gain

To reach the highest throughput possible, we want to maximize the per-GPU batch size.

With TF32 data type, we would get a CUDA out of memory error when we get this number to 32, which can be mitigated by taking the advantage of using AMP. Below we compare the throughput of NCCL v.s. SMDDP, using different per-GPU batch sizes of 16, 32, and 48 .

SMDDP v.s. NCCL YOLOv5x Training Throughput Numbers w.r.t batch size. Image by author.

SMDDP v.s. NCCL YOLOv5x Training Throughput Visualization w.r.t batch size. Image by author.

The gain is significant from our experiments by pushing this parameter to the GPU’s limit.

Summary

YOLOv5 is an excellent single stage detector that is definitely worth trying out for your object detection project. At AWS, we have the SageMaker ML platform and SageMaker Distributed Data Parallel Library to make the training process easier, faster and cheaper.

From our experiments and observations, using the SageMaker platform and SMDDP to perform distributed training can greatly speed up development iterations, which translate to savings in time and money. With only a few lines of code changes required, SMDDP helps deliver close to linearly scaling efficiency and superior performance for models implemented with PyTorch DDP.

We have several other PyTorch and TensorFlow examples available for you to further play around with SMDDP. We also encourage you to take what you have learned here and use SMDDP to accelerate the training of your own models. To reach out to us regarding any issues or feedback, you may raise an issue in the SMDDP-Examples GitHub repository.