Member-only story
SUPERCOMPUTING FOR ARTIFICIAL INTELLIGENCE — 05
Distributed Deep Learning with Horovod
Scaling Deep Learning on a Supercomputer using Horovod
[This post will be used in the master course Supercomputers Architecture at UPC Barcelona Tech with the support of the BSC]
In the previous post we explored how we can scale the training on Multiple GPUs in one Server with TensorFlow using tf.distributed.MirroredStrategy()
. Now, in this post, we will use Horovod API to scale the training on multiple servers following data parallelism strategies.
1. Horovod
Uber Engineering introduced Michelangelo, an internal ML-as-a-service platform that makes it easy to build and deploy these systems at scale. Horovod, a component of Michelangelo, is an open-source distributed training framework for TensorFlow, PyTorch, and MXNet. Its goal is to make distributed Deep Learning fast and easy to use via ring-allreduce and requires only a few lines of modification to user code. Horovod is available under the Apache 2.0 license.
A data-parallel distributed training paradigm
Conceptually, the data-parallel distributed training paradigm under Horovod is straightforward: