Distributed training on multiple high performance computing instances can reduce the time to train modern deep neural networks on large data from weeks to hours or minutes, making this training technique paramount for practical use of deep learning. However, scaling model training from a single instance to multiple instances is not a trivial task for users. For example, users need to understand how to share and synchronize data across multiple instances, which has a big impact on scaling efficiency. In addition, users also need to know how to adapt a training script running on a single instance to run on multiple instances.
In this blog post, we will present a fast and easy way to perform distributed training using the open source deep learning library Apache MXNet with the Horovod distributed learning framework. Specifically, we will showcase the performance benefits of the Horovod framework and demonstrate how to update an MXNet training script to run distributedly on Horovod.
What is Apache MXNet
Apache MXNet is an open-source deep learning framework used to build, train, and deploy deep neural networks. MXNet abstracts much of the complexity involved in implementing neural networks, is highly performant and scalable and offers APIs across popular programming languages such as Python, C++, Clojure, Java, Julia, R, Scala, and more.
Distributed training in MXNet with parameter server
The standard distributed training module in MXNet takes a parameter server approach. It uses a set of parameter servers to collect gradients from each worker, perform aggregation and distribute updated gradients back to each worker for the next iteration of optimization. However, identifying the right server to worker ratio has a big impact on scaling efficiency. If one parameter server is used, it may become the computational bottleneck. If too many parameter servers are used, the “all-to-all” communication pattern may saturate network interconnects.
What is Horovod
Horovod is an open-source distributed deep learning framework created at Uber. It leverages efficient inter-GPU and inter-node communication methods such as NVIDIA Collective Communications Library (NCCL) and Message Passing Interface (MPI) to distribute and aggregate model parameters between workers. It optimizes the use of network bandwidth and scales very well with dense deep neural network models. It currently supports several mainstream deep learning frameworks such as MXNet, Tensorflow, Keras, and PyTorch.
Integrating MXNet with Horovod
MXNet is integrated with Horovod through the common distributed training APIs defined in Horovod. The Horovod communication APIs
horovod.allreduce() are implemented using asynchronous callback functions by the MXNet engine as part of its task graph. By doing this, the data dependencies between communication and computation are handled by MXNet execution engine seamlessly to avoid performance loss due to synchronization. The distributed optimizer object defined in Horovod
horovod.DistributedOptimizer extends the MXNet
Optimizer so that it invokes the corresponding Horovod APIs to perform parameter updates in a distributed manner. All these implementation details are made transparent to end users.
You can quickly try out training a small convolutional network on MNIST data set using MXNet with Horovod on your MacBook.
First, install mxnet and horovod from PyPI:
pip install mxnet
pip install horovod
Note: if you see an error during the
pip install horovodstep then you might need to set the variable
vvis the minor version of your MacOS e.g
MACOSX_DEPLOYMENT_TARGET=10.12 pip install horovodfor MacOSX Sierra
Second, install OpenMPI from here.
Finally, download the example script mxnet_mnist.py from here and run the following commands from your MacBook terminal in your work directory:
mpirun -np 2 -H localhost:2 -bind-to none -map-by slot python mxnet_mnist.py
This will launch a training job using two CPUs on your MacBook and here is a sample output:
INFO:root:Epoch Batch [0-50] Speed: 2248.71 samples/sec accuracy=0.583640
INFO:root:Epoch Batch [50-100] Speed: 2273.89 samples/sec accuracy=0.882812
INFO:root:Epoch Batch [50-100] Speed: 2273.39 samples/sec accuracy=0.870000
When training a ResNet50-v1 model using ImageNet data set on 64 GPUs across eight p3.16xlarge EC2 instances each equipped with eight NVIDIA Tesla V100 GPUs on AWS cloud, we achieved a training throughput (i.e. the number of samples trained per second) of 45000 images/s. The training converges in 44 minutes after 90 epochs with a top-1 accuracy of 75.7%.
We compare this with MXNet distributed training using parameter servers on 8, 16, 32 and 64 GPUs and with single parameter server, server-to-worker ratio 1:1 and server-to-worker ratio 2:1 respectively. The result is shown in Fig. 1 below. The columns represent the number of trained images per second shown on the y-axis on the left and the lines represent the scaling efficiency (i.e. actual throughput vs ideal) shown on the y-axis on the right. As we can see, the selection of the number of parameter servers affects the scaling efficiency significantly. If only one instance is used as the parameter server, scaling efficiency drops to 38% at 64 GPUs. In order to achieve the same scaling efficiency as Horovod, the number of parameter server instances need to be twice the number of worker instances.
In Table 1 below, we compare the total instance cost when running different experiments on 64 GPUs. Using MXNet with Horovod achieves the best throughput with the least cost.
Steps to reproduce
The following steps explain how to reproduce our distributed training result using MXNet with Horovod. Please refer to this blog post for distributed training using MXNet with parameter server.
Create a cluster of homogeneous instances installed with MXNet version 1.4.0 or above and Horovod version 0.16.0 or above in order to run distributed training. You also need to install the required libraries for training with GPUs. Our instances are installed with Ubuntu 16.04 Linux, GPU Driver 396.44, CUDA 9.2, cuDNN 7.2.1 library, NCCL 2.2.13 communicator and OpenMPI 3.1.1. You may also use the Amazon Deep Learning AMI which comes with these libraries pre-installed.
Prepare your MXNet training script with Horovod APIs. The script below provides a simple skeleton of code block based on MXNet Gluon API. The lines with bold fonts are the additional ones for needed by Horovod in case you already have an existing training script. They are a few critical changes that are important for training with Horovod:
- Set a context based on Horovod local rank (Line 8) to make sure the training process is run on the correct GPU.
- Broadcast initial parameters from one worker to all other workers (Line 18) to make sure all workers started with the same initial parameters.
- Create a Horovod DistributedOptimizer (Line 25) to leverage the parameters update in a distributed manner.
1 import mxnet as mx
2 import horovod.mxnet as hvd
4 # Horovod: initialize Horovod
7 # Horovod: pin a GPU to be used to local rank
8 context = mx.gpu(hvd.local_rank())
10 # Build model
11 model = ...
13 # Initialize parameters
14 model.initialize(initializer, ctx=context)
15 params = model.collect_params()
17 # Horovod: broadcast parameters
18 hvd.broadcast_parameters(params, root_rank=0)
20 # Create optimizer
21 optimizer_params = ...
22 opt = mx.optimizer.create('sgd', **optimizer_params)
24 # Horovod: wrap optimizer with DistributedOptimizer
25 opt = hvd.DistributedOptimizer(opt)
27 # Create trainer and loss function
28 trainer = mx.gluon.Trainer(params, opt, kvstore=None)
29 loss_fn = ...
31 # Train model
32 for epoch in range(num_epoch):
Log in to one of the worker instances to launch the distributed training using an MPI directive. In this example, a distributed training job is launched on four instances with four GPUs each for a total of 16 GPUs in the cluster. It uses Stochastic Gradient Descent (SGD) optimizer with hypterparameters below:
- mini-batch size: 256
- learning rate: 0.1
- momentum: 0.9
- weight decay: 0.0001
As we scaled from one GPU to 64 GPUs, we linearly scaled the learning rate by the number of GPUs used (0.1 for one GPU to 6.4 for 64 GPUs) while keeping the number of images per GPU constant at 256 (mini-batch size of 256 for one GPU to 16,384 for 64 GPUs). The weight decay and momentum parameters were not altered as the number of GPUs increased. We use mixed precision in the training with float16 as the data type in the forward pass and float32 as the data type of the gradients to leverage the accelerated float16 computation supported by NVIDIA Tesla GPUs.
$ mpirun -np 16 \
-H server1:4,server2:4,server3:4,server4:4 \
-bind-to none -map-by slot \
-mca pml ob1 -mca btl ^openib \
In this blog post, we introduced a scalable approach to run distributed model training using Apache MXNet and Horovod. We demonstrated the scaling efficiency as well as cost efficiency compared to the parameter server-based approach using the ImageNet data set trained on a ResNet50-v1 model. We also showcased the steps for how you can convert an existing script to run on multiple instances using Horovod.
If you are just starting with MXNet and deep learning, please head to the MXNet install page to build MXNet first. We also highly recommend the MXNet in 60 minutes blog post as a great way to get started.
If you are already an MXNet user and would like to try out distributed training using Horovod, please head to the Horovod install page, build Horovod with MXNet, and follow the MNIST or ImageNet example.
*cost is calculated based on hourly rate of AWS on-demand EC2 instances