Speed up EfficientNet model training on Amazon SageMaker with PyTorch and SageMaker distributed data parallel library

A deep dive tutorial on how SageMaker Distributed Data Parallel (SMDDP) speeds up model training for the state-of-the-art EfficientNet model. By changing one line of code, we’ll switch from to the SMDDP backend and speed-up training by 14%.

Daniel Gomez Antonio
5 min readMar 21, 2022

Introduction

It is well known that technology evolves at a fast pace, which leads us (as the users) to be in the continuous need to learn and adopt new tools. In deep learning techniques, this is not an exception. In fact, deep learning strategies are not the same as they were years ago. For example, we have moved from the limitation of using a single computer for training to scalable resources using several machines with multiple GPUs. The new technology often requires coordinating such multiple devices to execute distributed parallel training for a faster training and experiments. It requires ML practitioners to adapt their distributed training scripts, which involves software engineering to deal with the complexity of ML models and computing infrastructure. To lift the hard workload and complexity, there are distributed training frameworks that provide tools that make our life easier. At the end, that is the purpose of technology, right? To make our lives easier.

In this post, we will talk about the SageMaker distributed data parallel library (SMDDP) and how to easily update a PyTorch training script to perform distributed training more efficiently with a few changes in the code. Spoiler alert, you only need to tell Torch distributed to use SMDDP as the collective communication library backend with few lines of code.

Distributed data parallel training and Amazon SageMaker

Let’s get into some more details. PyTorch Distributed Data Parallel (DDP) training relies on the single-program multiple-data paradigm. When a model is trained, it is replicated on every process, which processes different shards of the dataset. DDP guarantees communication between the the gradients to keep each model replica synchronized. PyTorch, one of the popular frameworks, has a native support for DDP. For more information see PyTorch Distribute Overview.

Model training and integrating machine learning models into applications can be cumbersome, which is why we at AWS have developed Amazon SageMaker, a fully managed end-to-end machine learning (ML) platform. SageMaker provides tooling and manages infrastructure; thereby ML scientists and developers can focus solely on model development. You may peruse through the SageMaker Examples GitHub repository to get insights on how SageMaker can simplify your machine learning pipeline. Using partitioning algorithms, SageMaker’s distributed training libraries automatically split large deep learning models and training datasets across AWS GPU instances in a fraction of the time it takes to do manually.” One of the benefits of SageMaker DDP (SMDDP) is the seamless integration with the common frameworks, like PyTorch and TensorFlow. That basically means that making an existing DDP script to work with SMDDP is relatively easy.

Let’s see how to use SageMaker’s DDP in your PyTorch script. For convenience, we provide a step-by-step guide to train EfficientNet model using SageMaker distributed data parallel library. We recommend launching a SageMaker notebook instance to run the example notebook without having to do any setup. Note that the above-mentioned notebook is already part of the installed SageMaker examples.

The training script used is part of the NVIDIA Deep Learning examples and can be used for various image classification models, in this post we focus on EfficientNet. It is model that proposes a more structured manner to scale up CNNs uniformly and delivers up to 10x better efficiency than other similar models, while super-passing state-of-the-art accuracy. We won’t go into deeper details of the model, but you can read more about EfficientNet in the original paper.

Modify the script to use SMDDP

In the snippet below, we observe that the script checks if the user wants distributed training, if so it sets the CUDA device and initializes the PyTorch distributed package. The backend parameter is used to indicate the engine that runs under the hood to achieve distributed data parallelism. The original example uses NCCL, which stands for NVIDIA Collective Communication Library and provides several communication primitives for high bandwidth and low latency.

if args.distributed:
args.gpu = args.local_rank % torch.cuda.device_count()
torch.cuda.set_device(args.gpu)
dist.init_process_group(backend="nccl")

SageMaker’s DDP used several novel techniques to reduce training time by rethinking the parameter-server-based approach, as well as recent developments in the cloud networking technologies like Elastic Fabric Adapter (EFA) and Scalable Reliable Datagram (SRD).

To modify a PyTorch training script to use SMDDP, the first thing to do is to import the torch_smddp library at the top of the script:

import smdistributed.dataparallel.torch.torch_smddp

Then, simply specify smddp as the backend parameter when initializing the distributed process:

dist.init_process_group(backend="smddp")

That’s it! The original script has been modified to use SageMaker DDP. As mentioned above, in the notebook to train EfficientNet with SMDDP you can run the modified script and see it in action.

Performance benchmarks with SMDDP

Now, let’s compare the performance in terms of throughput to evaluate the scaling efficiency of SMDDP. The throughput metric is defined as the number of images per second. A higher throughput indicates that more data can be processed, which leads to faster training. The following tables show the results after training the EfficientNet B0 and B4 models using different numbers of distributed nodes (each with 8 GPUs) with NCCL and SMDDP. For all training benchmarks, TensorFloat-32 (TF32) and AMP data types were used for precision. The benchmarking was executed in EC2 P4d instances, which contain NVIDIA A100 GPUs.

Throughput results after training EfficientNet-B4 with default DDP (NCCL) and SMDDP. Image by author.
Throughput results after training EfficientNet-B0 with default DDP (NCCL) and SMDDP. Image by author.
Performance gain comparison between default DDP and SMDDP when training EfficientNet Image by author.

As the benchmark results present, up to 14% more throughput can be achieved using SMDDP, which translates to 14% less time to train, hence, lower cost and faster experimentation for model training. Also, note that SMDDP’s scaling efficiency stands out for larger nodes, showing that SMDDP achieves a better throughput and communication in the EFA-enabled instances, this is also a consequence of using Scalable Reliable Datagram. As a result, the performance gain of SMDDP over NCCL increases as you scale to larger clusters.

Conclusion

We have demonstrated the continuous efforts from distributed training framework teams to achieve faster training and easier to be adopted and how libraries like SageMaker’s DDP contributes to make training even faster.

We have several other PyTorch and TensorFlow examples available for you to further play around with SMDDP. We also encourage you to take what you have learned here and use SMDDP to accelerate the training of your own models. To reach out to us regarding any issues or feedback, you may raise an issue in the SMDDP-Examples GitHub repository.

--

--