Accelerating PyTorch DDP by 10X With PowerSGD

Published in

PyTorch

8 min readNov 4, 2021

Authors: Yi Wang (Facebook AI), Alex Iankoulski (Amazon AWS), Pritam Damania (Facebook AI), Sundar Ranganathan (Amazon AWS)

As PyTorch DDP has been widely adopted for fully synchronous data-parallel distributed training, we often see that the scalability of multi-node training can be limited by the communication costs of exchanging gradients between nodes, despite the fact that DDP can overlap computation with communication to some extent under the hood. The communication bottleneck can become more severe in the absence of technologies like AWS Elastic Fabric Adapter (EFA), InfiniBand, or RDMA.

To improve the communication efficiency of distributed training, one popular approach is gradient compression, which can largely reduce the data transfer cost while still preserving a satisfactory accuracy. Among many compression approaches (e.g., quantization, sketching, sparsification), PyTorch has supported one state-of-the-art algorithm, PowerSGD, as a DDP communication hook. DDP communication hook has been released as a stable feature in PyTorch 1.10, which can work with multiple communication backends, including NCCL, Gloo, and MPI..

We demonstrate that PowerSGD can accelerate distributed training in a NLP use case by 10X+ on AWS without any loss in accuracy.

PowerSGD in PyTorch

What is PowerSGD?

PowerSGD gradient compression algorithm was published in NeurIPS’19 and NeurIPS’20 by (Thijs Vogels, et al.). PowerSGD views each non-vector gradient tensor as a matrix, and compresses it based on power iteration, which factorizes a tensor of shape MxN into two low-rank tensors of shape MxR and RxN, respectively (R is a configurable parameter). In contrast to singular value decomposition (SVD), the power compressor is much more computationally lightweight and can improve the matrix approximation progressively along with the training. Quote from the PowerSGD paper:

“With gradients shaped as in POWERSGD, computing the SVD of a stochastic gradient takes 673ms, the equivalent of computing 6 mini-batch gradients. In contrast, one full step of rank-2 POWERSGD, including communication between 16 workers, takes only 105ms.”

Why PowerSGD?

PowerSGD has a few nice properties: 1) the linearity of its compressor can leverage bandwidth-optimal ring-based allreduce; and 2) it can be natively supported by PyTorch’s communication backends (i.e., no need for any customized operand/operator).

DeepSpeed provides techniques like 1-bit Adam and 1-bit LAMB to reduce overall communication volume, however these approaches are tightly coupled with the optimizer algorithm and only work for the targeted optimizers. PowerSGD on the other hand is a general gradient compression algorithm that can be applied irrespective of the optimizer. In addition to this, PowerSGD can achieve a higher compression ratio compared to 1-bit compression.

Ease of Use

PyTorch provides customizable DDP Communication Hooks allowing users to completely override how gradients are communicated and aggregated in DDP. This can be used to implement async SGD algorithms and also gradient compression techniques like FP16 compression and PowerSGD. To apply PowerSGD in DDP, once a DDP model is created, the user only needs to register PowerSGD communication hook as below. The details can be found in PyTorch docs.

Algorithm Hyperparameters

PowerSGD communication hook implementation has two major hyperparameters: matrix_approximation_rank and start_powerSGD_iter, which can be used to balance the trade-off between training speed and accuracy.

matrix_approximation_rank (i.e., “R” mentioned as a configurable knob in the above matrix factorization) controls the size of compressed low-rank tensors, which determines the compression rate — the lower the rank, the stronger the compression but it is also more likely that accuracy could suffer. To tune this hyperparameter, we suggest starting from 1 and increasing by factors of 2 (like an exponential grid search, 1, 2, 4, …), until a satisfactory accuracy is reached.
start_powerSGD_iter defers PowerSGD compression until step start_powerSGD_iter and vanilla allreduce (or a more conservative compression) runs prior to the compression. It turns out that deferring PowerSGD compression via this hyperparameter can be critical for achieving a satisfactory accuracy for complex industrial use cases. This is mainly because the beginning training phase can often be very sensitive to inaccurate gradients, and early aggressive gradient compression may result in an irrecoverable impact on the accuracy. To tune start_powerSGD_iter, we suggest to start with 10% of total training steps and increase it until a satisfactory accuracy is reached. If there is a warm-up stage in the training, start_powerSGD_iter typically should be no less than the number of warm-up steps.

You can refer to our detailed docs in terms of how to tune these hyperparameters.

PowerSGD Variants

Batched PowerSGD: In contrast to the original PowerSGD implementation that compresses gradients layer by layer, PyTorch provides a variant — it can compress a flattened input tensor that batches all the gradients. This variant is faster, but can result in an accuracy loss that is not acceptable to applications, unless the matrix approximation rank is 1.
FP16/BF16 + PowerSGD/Batched PowerSGD: If the data type of input gradients is torch.float32, another orthogonal gradient compression can be type casted into torch.float16 or torch.bfloat16. Such type casting can be supported by compression wrappers like fp16_compress_wrapper or bf16_compress_wrapper. Therefore, PyTorch can also support another useful variant, which combines a PowerSGD or batched PowerSGD hook with a compression wrapper. Note that bf16_compress_wrapper is only supported after PyTorch 1.10 and requires NCCL version > 2.9.6.

Demonstration

Now we demonstrate that PowerSGD can accelerate distributed training in a NLP use case by 10X+ on AWS.

Experimental Setup

We evaluate the compression efficiency of PowerSGD on a RoBERTa model from a fork of fairseq repository (script). The model has a total of 1 billion parameters and is trained by using a French WikiText data set containing 25M examples (download script). The batch size per worker is 8. The algorithm hyperparameters matrix_approximation_rank and start_powerSGD_iter are 1 and 2, respectively.

The experiments are conducted on a standard EKS cluster (Kubernetes version 1.20) on AWS (us-east-1). All workloads run on a node group of p4d.24xlarge instances co-located in a single availability zone. The number of GPUs is varied from 32 to 128, where each machine has 8 NVIDIA A100 GPUs with 40 GB memory per GPU, connected through a 100 Gbit/s ethernet network. Each instance also features 96 vCPUs, 1152 GB RAM, and a 300 GB EBS gp3 volume. Data is stored on a shared FSx volume with 240 MB/s throughput. In order to test the efficiency of PowerSGD independently, EFA was not enabled.

We evaluated both the average training time per step and compression rate of five gradient compression schemes supported by DDP communication hooks, and compared them against the baseline (no compression) performance, in the ascending order of compression rate. These compression schemes are:

FP16 (fp16_compress_hook)
PowerSGD (powerSGD_hook)
FP16 + PowerSGD (fp16_compress_wrapper + powerSGD_hook)
Batched PowerSGD (batched_powerSGD_hook)
FP16 + Batched PowerSGD (fp16_compress_wrapper + batched_powerSGD_hook)

Experimental Results

The below graph shows the average training times per step across a number of GPUs and compression schemes. The baseline training time is ~4.8 seconds per step, and a simple FP16 compression results in a speedup of 1.4X — 2.1X. In comparison, different PowerSGD variants can achieve a training time per step of up to 0.27, 0.35, and 0.51 seconds, resulting in a speedup of up to 17.8X, 13.9X, and 9.6X on 32, 64, and 128 GPUs, respectively. With such gradient compression (matrix_approximation_rank=1), we also find that the perplexity can be achieved at the same level as the baseline after one epoch.

We also compared the compression rates among these compression schemes. Compared with the simple FP16 compression that can result in a 2X compression rate, PowerSGD can achieve a compression rate between ~1K to ~4K.

When Shall I Consider PowerSGD?

Bottlenecked by AllReduce Communication

Since PowerSGD needs to pay for the extra computation cost for the compression, it is designed for performance optimization where the major performance bottleneck is allreduce communication. This often implies that: 1) the model has a large number of parameters in the dense layers, 2) the communication is not largely overlapped with computation, and 3) the interconnection network is not super fast. Usually 1) can be told by the model architecture, 2) can be observed from the GPU traces collected by PyTorch profiler, and 3) the network is usually not equipped with advanced technologies like AWS EFA or InfiniBand.

In practice, the training performance of transformers can often be bottlenecked by allreduce communication.

Feasibility of Matrix Factorization

As mentioned aforehand, the power compressor relies on matrix factorization. However, there is no guarantee that gradient tensors can always be well approximated by low-rank tensors. Although it is very difficult to know if this premise is valid for a given model, the ease of use of PowerSGD communication hook can make the validation much easier. Note that sometimes applying bf16_compress_wrapper may lead to better accuracy as it can mitigate the potential floating-point underflow during compression.

Availability of Extra Memory

PowerSGD typically requires extra memory of the same size as the model’s gradients to improve accuracy. Therefore, this algorithm may not work for the use cases that have a memory constraint.

Acknowledgements

We would like to thank PyTorch teammates Rohan Varma, Yanli Zhao, Shen Li, Hongyi Jia, Mingzhe Li, Omkar Salpekar, Luca Wehrstedt, Serhat Yilmaz, Pavel Belevich, Can Balioglu, Howard Huang, Wanchao Liang, Alexander Golynski, Lei Tian, Guoqiang Jerry Chen, Boris Valkov, Min Ni and PowerSGD paper author Thijs Vogels for the code reviews and feedbacks on DDP communication hooks, Sinan Nasir for prototyping an initial version of DDP communication hook, Amy Yang, Jiecao Yu, Peter Tang, Jongsoo Park for the insightful discussions on gradient compression, Myle Ott for providing the benchmarking instructions on fairseq, and George Guanheng Zhang for the help with early exploration on another repository.

We would also like to thank our AWS collaborators Fabio Nonato DePaula, Maxime Hughes, Pierre-Yves Aquilanti, Ahn Tran, Arun Subramaniyan, and Karthik Raman for their help with setting up and conducting the PowerSGD experiments on AWS infrastructure.