Will it scale to N GPUs ?
Multi-node training & cluster network bandwidth
Why scale to multi-machine ?
Using multiple GPUs to train neural networks has become quite common with all deep learning frameworks (pytorch distributed, horovod, deepspeed etc), providing optimized, multi-GPU, and multi-machine training. Multi-machine distributed training plays a important role in speeding up model development.
- Faster Training times: Single-node multi-gpu training can be slow, especially for large datasets or complex models. Distributed training across multiple nodes allows further parallelization of computations, leading to faster convergence and reduced training times.
- Large models: With multi-machine distributed training, you can train larger models. For example, LLAMA-7B took 180k A100 GPU hours. That’s roughly equivalent to 30 A100 machines each with 8 GPUs running for a month.
- Optimal utilization of cluster resources: Multi-node distributed training can be used to minimize fragmentation in training cluster. For example, scheduling 4 gpus across 2 nodes (total 8 gpus) is easier than finding a single node with all 8 gpus available.
In practice though, setting up efficient multi-machine training is hard. Scaling efficiency refers to how trainer throughput increases when you increase the number of GPUs. One of the key bottlenecks here is inter-node bandwidth available in cluster. With low bandwidth multi-node training can end up being slower than a single node.
What happens during distributed training ?
Step 0: Data is fetched from store to all nodes participating in distributed training.
Step 1: During the forward pass each copy of the model does a forward pass with a batch size of data that it receives.
Step 2: A backward pass is then performed to compute the gradients. But the gradient is NOT used to update the weights yet.
Step 3: Now an all-reduce operation happens to all process (average gradients and then broadcast).
Step 4: The final all-reduced gradients are now used to update each model
By allowing each GPU to train on different batches of data, and all-reducing the gradients, you’re effectively training on a larger batch and therefore speeding up training.
- If storage -> compute node bandwidth is low, step0 will become a bottleneck as data needs to be pushed to compute nodes.
- If compute node -> compute node bandwidth is low, all-reduce in step3 will become a bottleneck.
- If all-reduce algorithm latency linearly increases with number of nodes, it will become a bottleneck beyond a point.
In this article, we will discuss about bandwidth requirements and optimizations in training as it is a hard constraint and takes significant hardware upgrades to change. We will consider all-reduce algorithm to be bandwidth optimal and low latency. As an exercise, let’s estimate bandwidth requirements GPU node to GPU node for distributed training.
Estimating bandwidth requirements:
In order to overlap communication and compute for faster training, communication time has to be less or equal to compute per step.
communication time per step ≤ compute time per step
Consider a model with P parameters being trained on A100 GPUs. It is trained with mixed precision using vanilla distributed data parallel strategy across N nodes.
How fast is compute per step for a model with P parameters ?
Compute in one step = forward pass + backward pass + optimization
As an abstraction, we can think of a neural network as multiple consecutive matrix multiplications in the three passes, interleaved with a number of cheaper non-linear operations. The scaling law paper [1] estimated that per step compute is C ≈ 6PD FLOPs where P is the total number of parameters in the model and D is the number of tokens in one batch.
If P = 10B parameter model with batch size = 32 and 128 tokens per sequence being trained over N nodes each 8 gpus. Note this is a rough approximation.
- Compute per step ~ 6 x 10B x 32 x 128 = 24.576 x 10¹⁴ FLOPs
Peak performance of a single A100 gpu = 300 x 10¹² FLOP/s.
At 50% utilization, 8 x N A100 gpus have a peak performance of : 3 x 10¹⁴ x 0.5 x 8 x N = 12 x N x 10¹⁴ - Time taken to complete one step : 24.576 / 12 x N ~ 2/N s
- To keep communication ≤ compute per step, gradients need to be transferred ≤ 2/N s. So, internal node bandwidth (IB) is related to number of nodes for a 10B parameter model with the equation: IB ≥ (N x 80 ) Gb/s.
- More generally, IB ≥ N x P GB/s (where P is number of parameters in billions).
From the above formula, to train a 10B parameter model over 4 nodes, we need a minimum internal bandwidth of 4 x 10 x 8 Gbps ~ 320 Gbps non blocking.
From the figure above, it becomes clear why certain network architectures (TCP/IP sockets, IB/RoCE, DGX-1) won’t scale for large models.
Will it scale to N GPUs ?
The key factor is compute vs communication time. If you have a very long compute time, then scaling across multiple nodes is easy. If the compute time is small (less complex architecture on a fast gpu, small batch size), communication / synchronization time will dominate and training performance will degrade as you keep adding more GPUs in the mix.
Optimizations to scale with poor network bandwidth:
- Increase batch size (increase compute time)
- Gradient aggregations (reduce network payload and traffic)
Conclusion:
- Distributed training requires high inter-node bandwidth for scaling.
- If we want to train multiple large models (≥10B parameters) simultaneously on a GPU cluster with multiple nodes, you need high internal cluster bandwidth ~ 200 Gbps — 1Tbps (higher the better depending on your workloads).
3. Beyond a scale, you need to consider using a more efficient all-reduce implementation to speed up training.
Appendix:
- Scaling laws for neural language models: https://arxiv.org/abs/2001.08361
- AWS EFA: https://aws.amazon.com/hpc/efa/
- Inter GPU communication https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31880/
- https://github.com/NVIDIA/nccl/issues/318