There are many ways to do distributed training for your neural network, such as using Horovod, BytePS, or the distributed package in PyTorch. This post looks into how does PyTorch implement the distributed package.
Assume that there is an application using data parallel to train the network in a single node. All you need to do is to modify the code:
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
dist.init_process_group("nccl", rank=rank, world_size=world_size)
model = DDP(model…
Stalled Ranks in Horovod
Horovod is a very popular framework for distributed training. With the communication primitives it provided, we could easily implement the multi-gpu training.
Recently we encounter a bug in our product: some training jobs complained stalled ranks. It’s weird that this bug happened randomly. After analyzing the log and source code, we found the suspicious code: there is a piece of code that using allreduce to average the metrics across all the ranks after an epoch. And one of the ranks is has one more metric because the code that determines to whether use that metric is positive for some random batches of data. Thus a deadlock occurs.
Just like multi-threaded programs, if the orders of horovod are not the same across the ranks, it might lead to deadlock.