How to measure inter-GPU connection speed (single node)?
Key GPU Knowledge for ML Researchers Series:
- Multi-GPU Training in PyTorch with Code
- How to measure inter-GPU connection speed (single node)? (this article)
In this article, we aim to measure the communication speed between GPUs within a single node. To achieve this, we will introduce two tests: the P2P bandwidth latency test and the NCCL test. The former assesses the communication speed for each pair of GPUs individually, while the latter evaluates data flow across all pairs simultaneously.
What is Nvidia CUDA Peer-to-Peer (P2P)?
In a very simplified description, P2P is functionality in Nvidia GPU that allows CUDA programs to access and transfer data from one GPU’s memory to another without having to go through a shared pool of system memory attached to a CPU. The image illustrates the P2P connection.
What is Nvidia GPU topology?
In a simple way, GPU topology defines how the GPUs within a computer system connect to each other and to other components, e.g., CPU and memory. The full list of GPU links and their explanation are provided here (screenshot below).
Several terms, such as NUMA node and PCIe host bridge, might be unfamiliar to many ML researchers. I’ve gathered some links that provide straightforward explanations for these concepts. In short, data traverses faster from SYS to NV12.
- NUMA node: video
- A better explanation of these terms, especially PCIe: link
- NVLink and NVSwitch: Nvidia’s blog
What is the P2P bandwidth latency test?
It assesses the bandwidth and latency for each pair of GPUs individually. The test is provided by CUDA samples.
How to run the test?
I followed the instructions here and made minor changes due to a reported bug on Ubuntu 20/22.
git clone https://github.com/NVIDIA/cuda-samples.git
# build samples
export SMS="50 52 60 61 70 75 80 86"
cd cuda-samples && make -k
# run test
cd bin/x86_64/linux/release
./p2pBandwidthLatencyTest
What is NCCL?
NCCL stands for “NVIDIA Collective Communication Library.” It serves as an optimized set of primitives for inter-GPU communication. It’s also one of the three backends supported in PyTorch Distributed. The following information is extracted from the NCCL GitHub repo.
NCCL (pronounced “Nickel”) is a stand-alone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.
How can we use NCCL to assess the performance of GPU communications?
In short, Nvidia also provides NCCL Tests to check both the performance and the correctness of NCCL operations.
How to run the test?
# build NCCL
git clone https://github.com/NVIDIA/nccl.git
cd nccl
make -j src.build
cd ..
# build NCCL tests
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make NCCL_HOME=<your path to nccl>/nccl/build
# run NCCL tests
export LD_LIBRARY_PATH=<your path to nccl>/nccl/build/lib
# all reduce on 4 gpus scanning from 1GB to 8GB
./build/all_reduce_perf -b 1G -e 8G -f 2 -g 4
Remember to replace “<your path to nccl>” with your path to NCCL that is built in the first step. I use “nccl/built” instead of “nccl” as required in the official readme due to this solution to the GitHub issue.
Experiments
We ran the above two tests on two different servers:
- M1: 8 * Nvidia A100-SXM4–80GB (DGX A100)
- M2: 4 * Nvidia RTX A6000
System topology
- M1: All GPUs are connected with third-generation NVLinks (NV12) via NVSwitch. We also present the system block diagram below.
Print out the GPU topology. (Mellanox Infiniband cards Connect-X 5 (mlx5) are omitted)
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 48-63,176-191 3
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 48-63,176-191 3
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 16-31,144-159 1
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 16-31,144-159 1
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 112-127,240-255 7
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 112-127,240-255 7
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 80-95,208-223 5
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 80-95,208-223 5
- M2: Only GPU 0&1 and 2&3 are connected with NVLinks. Other connections need to traverse through the NUMA node. Print out the GPU topology.
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X NV4 SYS SYS 0-9,20-29 0
GPU1 NV4 X SYS SYS 0-9,20-29 0
GPU2 SYS SYS X NV4 10-19,30-39 1
GPU3 SYS SYS NV4 X 10-19,30-39 1
P2P bandwidth latency test
- M1 (drop CPU as it is irrelevant)
P2P Connectivity Matrix
D\D 0 1 2 3 4 5 6 7
0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1533.37 14.72 16.97 17.01 17.42 17.08 17.49 17.07
1 14.90 1530.36 16.92 17.09 17.39 17.55 16.97 17.62
2 16.91 17.24 1537.89 14.67 17.46 16.96 17.49 17.00
3 17.24 17.14 14.83 1536.38 17.34 17.53 16.95 17.63
4 16.86 17.27 17.10 17.30 1525.88 15.62 17.51 17.17
5 16.85 17.26 17.11 17.32 15.76 1568.78 17.49 17.09
6 16.77 17.36 17.15 17.37 16.87 17.65 1571.93 15.80
7 16.65 17.06 17.14 17.36 17.00 17.68 15.96 1568.78
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1528.86 268.68 269.92 271.64 273.68 274.18 274.02 274.24
1 268.56 1547.03 264.01 274.54 274.08 271.68 275.02 273.81
2 268.52 270.91 1540.93 273.65 272.05 273.71 274.78 274.01
3 269.19 271.50 269.56 1586.29 272.85 275.42 274.77 275.20
4 270.60 272.12 269.99 275.84 1584.69 273.95 274.78 273.51
5 270.16 271.49 270.05 274.75 274.78 1579.88 273.10 273.68
6 272.31 268.64 270.10 275.19 274.23 274.48 1581.48 272.35
7 271.57 271.91 272.25 274.82 274.54 275.16 275.90 1584.69
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1555.50 15.87 18.73 18.87 19.14 19.17 19.17 19.18
1 16.04 1599.28 18.83 18.69 19.10 18.95 19.14 19.16
2 18.28 18.26 1573.51 15.68 19.30 19.09 19.19 19.24
3 18.37 18.50 15.68 1596.83 19.26 19.23 19.16 19.23
4 18.63 18.86 18.84 18.90 1599.28 17.88 19.01 19.06
5 18.53 18.83 18.81 18.90 17.94 1593.57 18.96 19.04
6 18.50 18.84 18.77 18.90 18.75 19.02 1596.02 18.14
7 18.42 18.83 18.74 18.88 18.72 18.99 18.16 1593.57
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1552.41 419.61 419.23 418.82 417.81 418.04 418.82 420.02
1 421.44 1606.68 451.71 517.44 517.10 518.13 518.30 519.33
2 419.93 452.19 1578.28 449.81 448.26 449.42 449.55 449.55
3 421.31 518.47 451.43 1604.21 520.71 517.27 518.76 519.63
4 420.26 516.09 450.97 518.62 1599.28 519.17 518.48 517.29
5 421.31 517.46 451.61 518.66 517.80 1597.65 517.65 519.01
6 421.63 520.33 451.43 515.58 519.52 517.29 1593.57 517.57
7 421.85 516.26 451.74 519.01 519.87 519.87 518.83 1588.71
P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3 4 5 6 7
0 2.64 24.63 25.16 24.60 24.61 24.74 24.52 24.70
1 24.96 2.14 24.78 24.61 24.24 24.60 24.51 24.62
2 25.47 24.69 2.39 24.62 24.61 24.70 24.55 24.60
3 25.45 24.60 24.60 2.09 24.60 24.61 24.55 24.60
4 25.19 23.84 24.90 24.59 2.13 24.50 24.60 24.52
5 25.62 24.59 25.25 24.53 24.58 2.64 24.67 24.67
6 24.61 24.57 24.86 24.48 24.60 24.60 2.34 24.58
7 25.37 24.57 25.28 24.49 24.60 24.61 24.57 2.11
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3 4 5 6 7
0 2.64 3.50 3.43 3.44 3.42 3.50 3.50 3.42
1 2.96 2.13 2.96 2.97 2.95 2.96 3.00 2.96
2 3.41 3.39 2.38 3.47 3.37 3.36 3.45 3.36
3 2.96 3.01 3.02 2.09 3.03 2.97 3.09 3.00
4 2.98 2.96 2.97 2.96 2.13 3.09 3.05 3.04
5 3.59 3.60 3.53 3.52 3.55 2.64 3.58 3.60
6 3.01 2.92 3.00 2.92 3.01 3.00 2.30 2.99
7 2.96 3.00 3.02 2.97 2.97 2.96 2.95 2.10
- M2 (drop CPU as it is irrelevant)
P2P Connectivity Matrix
D\D 0 1 2 3
0 1 1 1 1
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 672.91 11.64 11.55 11.51
1 11.60 673.49 11.54 11.53
2 11.52 11.55 673.20 11.50
3 11.54 11.54 11.50 674.15
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3
0 673.49 52.78 8.27 9.12
1 52.79 675.53 8.27 8.47
2 8.69 8.83 675.53 52.78
3 7.98 7.98 52.75 674.95
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 677.87 16.35 16.30 16.28
1 16.33 678.46 16.30 16.27
2 16.25 16.26 678.17 14.85
3 16.28 16.29 15.31 678.02
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 677.87 101.76 17.21 16.65
1 101.83 678.32 16.61 16.75
2 17.30 16.64 678.91 101.66
3 16.56 16.75 101.68 678.00
P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3
0 1.60 11.00 11.49 15.74
1 10.31 1.56 12.04 15.38
2 10.75 11.66 1.54 11.00
3 11.93 11.85 10.43 1.58
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3
0 1.60 1.36 1.31 1.31
1 1.31 1.56 1.27 1.33
2 1.30 1.32 1.51 1.28
3 1.33 1.32 1.30 1.58
Clearly, when P2P is enabled, the data traverses faster if there is NVLink connected. Otherwise, the data traverses through single or multiple PCIe hops. Here’s the speed of different PCIe generations. M1 has Gen 4 PCIe and M2 has Gen 3.
NCCL Tests
We scanned from 1G to 8G for all reduce, broadcast, and all-to-all collective computing operations. We also toggle the NCCL_P2P_LEVEL to LOC to disable the P2P connection. Different topology combinations are presented in the table below.
Again, when P2P is enabled, the data traverses faster if there is NVLink connected. Otherwise, the data traverses through single or multiple PCIe hops. M1 (DGX-A100) has significantly higher computation speed and bandwidth compared to M2.