RiseML supports distributed deep learning with Horovod and TensorFlow

Published in

RiseML Blog

4 min readMar 9, 2018

Most deep learning models are trained on a single node, because parallelisation is difficult due to strong computational dependencies. Training a model in a distributed fashion is still possible and can speed up the training time significantly. Distributed TensorFlow provides a framework for distributed training, but adapting a model for it is an extremely tedious and time-consuming job.

A few days ago we released RiseML version 1.1.0 with support for Horovod, a new open-source library by Uber to simplify distributed training with TensorFlow. The benefit over Distributed TensorFlow is that Horovod requires only small changes to the machine learning code while achieving similar improvements in training time. In official benchmarks, Horovod achieved a speed-up of up to 115x while using 128 GPUs.

32 servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network. 90% scaling efficiency for both Inception V3 and ResNet-101, and 79% scaling efficiency for VGG-16

How does Horovod work?

Horovod works like Distributed TensorFlow in synchronous data parallel mode. That means that each node computes parameter updates to the model in parallel. In Distributed TensorFlow, after every batch the parameters are collected by dedicated nodes called parameter servers. This can quickly become a bottleneck as the number of nodes grows. Choosing the optimal number of parameter servers for an experiment in itself is not trivial. Instead, in Horovod nodes communicate directly. Horovod averages gradients and communicates those gradients by logically organizing nodes in a ring and performing a ring-allreduce algorithm for which communication cost scales linearly with the number of nodes in the system.

Thus, no parameter servers are needed. Additionally, it uses low-level primitives like Message Passing Interface (MPI) and Nvidia’s NCCL to speed up the exchange of gradients.

Why use RiseML for Horovod?

Setting up and tuning an MPI cluster with GPU support is annoying. Additionally, executing experiments is quite different from a regular python script. For example, training a model with Horovod on 4 nodes with 4 GPUs each could be let’s say … a little more intuitive:

$ mpirun -np 16 \
  -H server1:4,server2:4,server3:4,server4:4 \
  -bind-to none -map-by slot \
  -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
  -mca pml ob1 -mca btl ^openib \
  python train.py

There is no reason why running an experiment should be so cumbersome. With RiseML, Horovod is available for all users by default and doesn’t require any additional setup and tuning. Installing the right version of TensorFlow, Horovod, system libraries, as well as the setup for MPI, NCCL, and other low-level details is done automatically.

Usage

To run a Horovod experiment you need to make a couple of small changes to your machine learning program and add a Horovod section to the riseml.yml file.

Here is an excerpt from the Horovod docs with the code additions:

Run hvd.init().
Pin a server GPU to be used by this process using config.gpu_options.visible_device_list.
Scale the learning rate by number of workers.
Wrap optimizer in hvd.DistributedOptimizer.
Add hvd.BroadcastGlobalVariablesHook(0) to broadcast initial variable states from rank 0 to all other processes.
Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them.

The riseml.yml with a Horovod section:

project: imagenet
train:
  framework: tensorflow
  tensorflow:
    version: 1.5.0
    horovod:
      workers: 32
  resources:
    gpus: 4
    cpus: 12
    mem: 10240
  run:
  - python train.py

Here, we chose a configuration with 32 workers. Each worker is equipped with 4 GPUs = 128 MPI processes with one GPU each. Workers communicate between each other via network. Thus, it is recommended to maximize the number of GPUs per worker to minimize network overhead. Check out the RiseML docs on Horovod for more details.

Needless to say, you want to have a fast network, preferably InfiniBand. High-bandwidth network options in the cloud vary: AWS supports up to 25Gbps, Azure up to 30 Gbit/s, GCE only 10 Gbit/s.

It should also be noted that you can use Horovod to scale single-GPU programs to multiple GPUs. Check out this article on training a single-GPU Keras model with Horovod on 4 GPUs.

Try it out yourself!

With Horovod support in RiseML you can easily make use of distributed training. RiseML can be deployed on-premises or in your cloud environment. You can even combine Horovod with autoscaling in the cloud to stay on budget.

⚡ Install RiseML with Horovod now!

Thanks to Nischal Padmanabha for reading drafts of this.

RiseML supports distributed deep learning with Horovod and TensorFlow

How does Horovod work?

Why use RiseML for Horovod?

Usage

Try it out yourself!

Written by Henning Peters