How to measure inter-GPU connection speed (single node)?

Anthony Peng
Polo Club of Data Science | Georgia Tech
8 min readOct 12, 2023

Key GPU Knowledge for ML Researchers Series:

  1. Multi-GPU Training in PyTorch with Code
  2. How to measure inter-GPU connection speed (single node)? (this article)

In this article, we aim to measure the communication speed between GPUs within a single node. To achieve this, we will introduce two tests: the P2P bandwidth latency test and the NCCL test. The former assesses the communication speed for each pair of GPUs individually, while the latter evaluates data flow across all pairs simultaneously.

Nvidia DGX A100

What is Nvidia CUDA Peer-to-Peer (P2P)?
In a very simplified description, P2P is functionality in Nvidia GPU that allows CUDA programs to access and transfer data from one GPU’s memory to another without having to go through a shared pool of system memory attached to a CPU. The image illustrates the P2P connection.

Image credit to Dr. Donald Kinghorn’s blog

What is Nvidia GPU topology?
In a simple way, GPU topology defines how the GPUs within a computer system connect to each other and to other components, e.g., CPU and memory. The full list of GPU links and their explanation are provided here (screenshot below).

Several terms, such as NUMA node and PCIe host bridge, might be unfamiliar to many ML researchers. I’ve gathered some links that provide straightforward explanations for these concepts. In short, data traverses faster from SYS to NV12.

What is the P2P bandwidth latency test?
It assesses the bandwidth and latency for each pair of GPUs individually. The test is provided by CUDA samples.

How to run the test?
I followed the instructions here and made minor changes due to a reported bug on Ubuntu 20/22.

git clone https://github.com/NVIDIA/cuda-samples.git

# build samples
export SMS="50 52 60 61 70 75 80 86"
cd cuda-samples && make -k

# run test
cd bin/x86_64/linux/release
./p2pBandwidthLatencyTest

What is NCCL?
NCCL stands for “NVIDIA Collective Communication Library.” It serves as an optimized set of primitives for inter-GPU communication. It’s also one of the three backends supported in PyTorch Distributed. The following information is extracted from the NCCL GitHub repo.

NCCL (pronounced “Nickel”) is a stand-alone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.

How can we use NCCL to assess the performance of GPU communications?
In short, Nvidia also provides NCCL Tests to check both the performance and the correctness of NCCL operations.

How to run the test?

# build NCCL
git clone https://github.com/NVIDIA/nccl.git
cd nccl
make -j src.build
cd ..

# build NCCL tests
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make NCCL_HOME=<your path to nccl>/nccl/build

# run NCCL tests
export LD_LIBRARY_PATH=<your path to nccl>/nccl/build/lib
# all reduce on 4 gpus scanning from 1GB to 8GB
./build/all_reduce_perf -b 1G -e 8G -f 2 -g 4

Remember to replace “<your path to nccl>” with your path to NCCL that is built in the first step. I use “nccl/built” instead of “nccl” as required in the official readme due to this solution to the GitHub issue.

Experiments

We ran the above two tests on two different servers:

  • M1: 8 * Nvidia A100-SXM4–80GB (DGX A100)
  • M2: 4 * Nvidia RTX A6000

System topology

  • M1: All GPUs are connected with third-generation NVLinks (NV12) via NVSwitch. We also present the system block diagram below.
DGX A100 SXM4 system block diagram. (Credit to ServeTheHome)

Print out the GPU topology. (Mellanox Infiniband cards Connect-X 5 (mlx5) are omitted)

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 48-63,176-191 3
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 48-63,176-191 3
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 16-31,144-159 1
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 16-31,144-159 1
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 112-127,240-255 7
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 112-127,240-255 7
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 80-95,208-223 5
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 80-95,208-223 5
  • M2: Only GPU 0&1 and 2&3 are connected with NVLinks. Other connections need to traverse through the NUMA node. Print out the GPU topology.
     GPU0  GPU1  GPU2  GPU3  CPU Affinity  NUMA Affinity
GPU0 X NV4 SYS SYS 0-9,20-29 0
GPU1 NV4 X SYS SYS 0-9,20-29 0
GPU2 SYS SYS X NV4 10-19,30-39 1
GPU3 SYS SYS NV4 X 10-19,30-39 1

P2P bandwidth latency test

  • M1 (drop CPU as it is irrelevant)
P2P Connectivity Matrix
D\D 0 1 2 3 4 5 6 7
0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1533.37 14.72 16.97 17.01 17.42 17.08 17.49 17.07
1 14.90 1530.36 16.92 17.09 17.39 17.55 16.97 17.62
2 16.91 17.24 1537.89 14.67 17.46 16.96 17.49 17.00
3 17.24 17.14 14.83 1536.38 17.34 17.53 16.95 17.63
4 16.86 17.27 17.10 17.30 1525.88 15.62 17.51 17.17
5 16.85 17.26 17.11 17.32 15.76 1568.78 17.49 17.09
6 16.77 17.36 17.15 17.37 16.87 17.65 1571.93 15.80
7 16.65 17.06 17.14 17.36 17.00 17.68 15.96 1568.78
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1528.86 268.68 269.92 271.64 273.68 274.18 274.02 274.24
1 268.56 1547.03 264.01 274.54 274.08 271.68 275.02 273.81
2 268.52 270.91 1540.93 273.65 272.05 273.71 274.78 274.01
3 269.19 271.50 269.56 1586.29 272.85 275.42 274.77 275.20
4 270.60 272.12 269.99 275.84 1584.69 273.95 274.78 273.51
5 270.16 271.49 270.05 274.75 274.78 1579.88 273.10 273.68
6 272.31 268.64 270.10 275.19 274.23 274.48 1581.48 272.35
7 271.57 271.91 272.25 274.82 274.54 275.16 275.90 1584.69

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1555.50 15.87 18.73 18.87 19.14 19.17 19.17 19.18
1 16.04 1599.28 18.83 18.69 19.10 18.95 19.14 19.16
2 18.28 18.26 1573.51 15.68 19.30 19.09 19.19 19.24
3 18.37 18.50 15.68 1596.83 19.26 19.23 19.16 19.23
4 18.63 18.86 18.84 18.90 1599.28 17.88 19.01 19.06
5 18.53 18.83 18.81 18.90 17.94 1593.57 18.96 19.04
6 18.50 18.84 18.77 18.90 18.75 19.02 1596.02 18.14
7 18.42 18.83 18.74 18.88 18.72 18.99 18.16 1593.57
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1552.41 419.61 419.23 418.82 417.81 418.04 418.82 420.02
1 421.44 1606.68 451.71 517.44 517.10 518.13 518.30 519.33
2 419.93 452.19 1578.28 449.81 448.26 449.42 449.55 449.55
3 421.31 518.47 451.43 1604.21 520.71 517.27 518.76 519.63
4 420.26 516.09 450.97 518.62 1599.28 519.17 518.48 517.29
5 421.31 517.46 451.61 518.66 517.80 1597.65 517.65 519.01
6 421.63 520.33 451.43 515.58 519.52 517.29 1593.57 517.57
7 421.85 516.26 451.74 519.01 519.87 519.87 518.83 1588.71

P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3 4 5 6 7
0 2.64 24.63 25.16 24.60 24.61 24.74 24.52 24.70
1 24.96 2.14 24.78 24.61 24.24 24.60 24.51 24.62
2 25.47 24.69 2.39 24.62 24.61 24.70 24.55 24.60
3 25.45 24.60 24.60 2.09 24.60 24.61 24.55 24.60
4 25.19 23.84 24.90 24.59 2.13 24.50 24.60 24.52
5 25.62 24.59 25.25 24.53 24.58 2.64 24.67 24.67
6 24.61 24.57 24.86 24.48 24.60 24.60 2.34 24.58
7 25.37 24.57 25.28 24.49 24.60 24.61 24.57 2.11
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3 4 5 6 7
0 2.64 3.50 3.43 3.44 3.42 3.50 3.50 3.42
1 2.96 2.13 2.96 2.97 2.95 2.96 3.00 2.96
2 3.41 3.39 2.38 3.47 3.37 3.36 3.45 3.36
3 2.96 3.01 3.02 2.09 3.03 2.97 3.09 3.00
4 2.98 2.96 2.97 2.96 2.13 3.09 3.05 3.04
5 3.59 3.60 3.53 3.52 3.55 2.64 3.58 3.60
6 3.01 2.92 3.00 2.92 3.01 3.00 2.30 2.99
7 2.96 3.00 3.02 2.97 2.97 2.96 2.95 2.10
  • M2 (drop CPU as it is irrelevant)
P2P Connectivity Matrix
D\D 0 1 2 3
0 1 1 1 1
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 672.91 11.64 11.55 11.51
1 11.60 673.49 11.54 11.53
2 11.52 11.55 673.20 11.50
3 11.54 11.54 11.50 674.15
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3
0 673.49 52.78 8.27 9.12
1 52.79 675.53 8.27 8.47
2 8.69 8.83 675.53 52.78
3 7.98 7.98 52.75 674.95

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 677.87 16.35 16.30 16.28
1 16.33 678.46 16.30 16.27
2 16.25 16.26 678.17 14.85
3 16.28 16.29 15.31 678.02
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 677.87 101.76 17.21 16.65
1 101.83 678.32 16.61 16.75
2 17.30 16.64 678.91 101.66
3 16.56 16.75 101.68 678.00

P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3
0 1.60 11.00 11.49 15.74
1 10.31 1.56 12.04 15.38
2 10.75 11.66 1.54 11.00
3 11.93 11.85 10.43 1.58
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3
0 1.60 1.36 1.31 1.31
1 1.31 1.56 1.27 1.33
2 1.30 1.32 1.51 1.28
3 1.33 1.32 1.30 1.58

Clearly, when P2P is enabled, the data traverses faster if there is NVLink connected. Otherwise, the data traverses through single or multiple PCIe hops. Here’s the speed of different PCIe generations. M1 has Gen 4 PCIe and M2 has Gen 3.

NCCL Tests

We scanned from 1G to 8G for all reduce, broadcast, and all-to-all collective computing operations. We also toggle the NCCL_P2P_LEVEL to LOC to disable the P2P connection. Different topology combinations are presented in the table below.

Again, when P2P is enabled, the data traverses faster if there is NVLink connected. Otherwise, the data traverses through single or multiple PCIe hops. M1 (DGX-A100) has significantly higher computation speed and bandwidth compared to M2.

--

--