Will it scale to N GPUs ?

Multi-node training & cluster network bandwidth

Jaideep Ray

Published in

Better ML

5 min readApr 1, 2024

https://github.com/NVIDIA/nccl/issues/318 [1]

Why scale to multi-machine ?

Using multiple GPUs to train neural networks has become quite common with all deep learning frameworks (pytorch distributed, horovod, deepspeed etc), providing optimized, multi-GPU, and multi-machine training. Multi-machine distributed training plays a important role in speeding up model development.

Faster Training times: Single-node multi-gpu training can be slow, especially for large datasets or complex models. Distributed training across multiple nodes allows further parallelization of computations, leading to faster convergence and reduced training times.
Large models: With multi-machine distributed training, you can train larger models. For example, LLAMA-7B took 180k A100 GPU hours. That’s roughly equivalent to 30 A100 machines each with 8 GPUs running for a month.
Optimal utilization of cluster resources: Multi-node distributed training can be used to minimize fragmentation in training cluster. For example, scheduling 4 gpus across 2 nodes (total 8 gpus) is easier than finding a single node with all 8 gpus available.

In practice though, setting up efficient multi-machine training is hard. Scaling efficiency refers to how trainer throughput increases when you increase the number of GPUs. One of the key bottlenecks here is inter-node bandwidth available in cluster. With low bandwidth multi-node training can end up being slower than a single node.

What happens during distributed training ?

Distributed training on 3 nodes. Gradients are exchanged across nodes for all-reduce.

Step 0: Data is fetched from store to all nodes participating in distributed training.

Step 1: During the forward pass each copy of the model does a forward pass with a batch size of data that it receives.

Step 2: A backward pass is then performed to compute the gradients. But the gradient is NOT used to update the weights yet.

Step 3: Now an all-reduce operation happens to all process (average gradients and then broadcast).

Step 4: The final all-reduced gradients are now used to update each model

By allowing each GPU to train on different batches of data, and all-reducing the gradients, you’re effectively training on a larger batch and therefore speeding up training.

If storage -> compute node bandwidth is low, step0 will become a bottleneck as data needs to be pushed to compute nodes.
If compute node -> compute node bandwidth is low, all-reduce in step3 will become a bottleneck.
If all-reduce algorithm latency linearly increases with number of nodes, it will become a bottleneck beyond a point.

In this article, we will discuss about bandwidth requirements and optimizations in training as it is a hard constraint and takes significant hardware upgrades to change. We will consider all-reduce algorithm to be bandwidth optimal and low latency. As an exercise, let’s estimate bandwidth requirements GPU node to GPU node for distributed training.

Estimating bandwidth requirements:

In order to overlap communication and compute for faster training, communication time has to be less or equal to compute per step.

communication time per step ≤ compute time per step

Consider a model with P parameters being trained on A100 GPUs. It is trained with mixed precision using vanilla distributed data parallel strategy across N nodes.

How fast is compute per step for a model with P parameters ?

Compute in one step = forward pass + backward pass + optimization

As an abstraction, we can think of a neural network as multiple consecutive matrix multiplications in the three passes, interleaved with a number of cheaper non-linear operations. The scaling law paper [1] estimated that per step compute is C ≈ 6PD FLOPs where P is the total number of parameters in the model and D is the number of tokens in one batch.

If P = 10B parameter model with batch size = 32 and 128 tokens per sequence being trained over N nodes each 8 gpus. Note this is a rough approximation.

Compute per step ~ 6 x 10B x 32 x 128 = 24.576 x 10¹⁴ FLOPs
Peak performance of a single A100 gpu = 300 x 10¹² FLOP/s.
At 50% utilization, 8 x N A100 gpus have a peak performance of : 3 x 10¹⁴ x 0.5 x 8 x N = 12 x N x 10¹⁴
Time taken to complete one step : 24.576 / 12 x N ~ 2/N s
To keep communication ≤ compute per step, gradients need to be transferred ≤ 2/N s. So, internal node bandwidth (IB) is related to number of nodes for a 10B parameter model with the equation: IB ≥ (N x 80 ) Gb/s.
More generally, IB ≥ N x P GB/s (where P is number of parameters in billions).

From the above formula, to train a 10B parameter model over 4 nodes, we need a minimum internal bandwidth of 4 x 10 x 8 Gbps ~ 320 Gbps non blocking.

From the figure above, it becomes clear why certain network architectures (TCP/IP sockets, IB/RoCE, DGX-1) won’t scale for large models.