Image for post
Image for post

There are many ways to do distributed training for your neural network, such as using Horovod, BytePS, or the distributed package in PyTorch. This post looks into how does PyTorch implement the distributed package.

How to launch distributed data parallel training in PyTorch?

Assume that there is an application using data parallel to train the network in a single node. All you need to do is to modify the code:

  • to initialize a process group for communication
  • wrap the network in distributed data parallel instead of data parallel
  • use a distributed data sampler to split the data into different ranks
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
dist.init_process_group("nccl", rank=rank, world_size=world_size)
model = DDP(model…

Stalled Ranks in Horovod

Horovod is a very popular framework for distributed training. With the communication primitives it provided, we could easily implement the multi-gpu training.

Recently we encounter a bug in our product: some training jobs complained stalled ranks. It’s weird that this bug happened randomly. After analyzing the log and source code, we found the suspicious code: there is a piece of code that using allreduce to average the metrics across all the ranks after an epoch. And one of the ranks is has one more metric because the code that determines to whether use that metric is positive for some random batches of data. Thus a deadlock occurs.

Just like multi-threaded programs, if the orders of horovod are not the same across the ranks, it might lead to deadlock.


Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store