Optcast: Open-Source In-Network Aggregation for Distributed Deep Learning

Published in

nttlabs

4 min readFeb 2, 2024

Challenges of Data Parallelism Training

Distributed training across multiple GPUs has become increasingly important, regardless of the size of the deep learning model. One of the most basic distributed strategies is data parallelism, which can be applied to any model. In data parallelism, each GPU maintains a replica of the model, and the aggregated gradient values are shared among GPUs to update the model. The aggregation of gradient values is typically done using Allreduce collective communication, and the performance of the Allreduce limits workload throughput, especially at larger scales.

The NVIDIA Collective Communications Library (NCCL) is the current de facto standard for collective communication among GPUs. This library provides two main Allreduce algorithms, ring and double-binary-tree. Collective communication performance is generally evaluated using two metrics: the amount of data transferred and latency. The table below summarizes these metrics for each algorithm, where n is the number of GPUs and L is the data size of Allreduce.

The ring and double-binary-tree algorithms are composed of multiple peer-to-peer communications and aggregation operations on GPUs. Therefore, they have a latency that depends on the number of GPUs n. In addition, the amount of data transferred is approximately twice the theoretical minimum value of L.

In-Network Aggregation

To mitigate data transfer and latency of Allreduce, offloading aggregation to network components has gained attention. For example, NVIDIA SHARP performs data aggregation on switch ASICs using a virtually constructed aggregation tree. In Google Vertex AI, aggregation is operated using multiple dedicated servers called reducers (You can find more detail information in this post). These aggregation technologies are called in-network aggregation.

However, existing tools have proprietary implementations, are limited to InfiniBand, and require specific hardware.

Optcast: Open-Source In-Network Aggregation

Therefore, we have implemented an open-source in-network aggregation plugin Optcast for NCCL that supports the Socket/RoCE/InfiniBand transport protocols. This allows in-network aggregation to be carried out in various environments. The table below shows that Optcast supports a various of execution environments.

https://github.com/osrg/optcast

Please read the following documents for building and testing this project.

Our implementation, Optcast, is similar to the system in the reduction server of VertexAI. The GPU sends the segments of the data to multiple reduction servers. When the reduction server receives all the segments, it aggregates them and returns the result to the GPUs.

This figure shows the data flow of in-network aggregation of Optcast and SHARP. Each GPU has data from A to Z. In Optcast, the data is divided and sent from the GPU to multiple reduction servers, which are shown as the dashed arrows (The arrows for the right reducer server are omitted). The reduction servers aggregate the received data and return the result to the GPU, which is represented as the bold arrows. This figure shows the case where two reduction servers are used.

Evaluation Results

Finally, we compare the performance of Allreduce with Optcast and other existing libraries. Our experiment environment has four GPU servers which are equipped with two NVIDIA V100 GPUs and two 100G NICs (ConnectX-6). NCCL’s Allreduce was performed only GPUs. For in-network aggregation, SHARP additionally used a NVIDIA Switch-IB 2, and Optcast used four reduction servers.

We describe below a more detailed experimental environment.

The figure shows the throughput of Allreduce on FP32 with 8 GPUs. Allreduce data sizes range from 8MB to 1024MB. Optcast shows higher throughput than the standard NCCL implementation. It also competes with SHARP for data sizes above 128 MB; Optcast and SHARP can theoretically achieve throughputs up to 1.75 times higher than NCCL for an 8-client Allreduce, which is close to its value. On the other hand, Optcast’s throughput is lower than SHARP’s when the data size is less than 64 MB. This is due to the overhead of sending and receiving. However, we believe it will not be an issue for data parallelism because the Allreduce data size is larger than 64MB in many cases on this workload.

As a next step, we will incorporate Optcast, our in-network aggregation tool, into the distributed deep learning training workloads such as Pytorch’DistributedDataParallel, DeepSpeed(ZeRO), Horovod, and so on.

NTT is hiring!

We at NTT are looking for engineers who work in Open Source communities like WebAssembly, containers, and so on. Please visit https://www.rd.ntt/e/sic/recruit/ to see how to join us.