Going distributed

Published in

YData

5 min readApr 23, 2020

Demystifying the distribution of Deep Neural Networks

Deep Learning (DL) has been one of the hottest trends in AI. Its use goes from developing self-driving cars at Uber to the development of Google’s Duplex voice assistant.

To achieve the maturity and complexity of such solutions, it is required that we train complex DL models. Besides, as the data is growing and use-cases are becoming more sophisticated, many organizations are facing the challenge of having to train Machine Learning models with a lot of data while maintaining short training times. In a large portion of cases, it’s hard enough to fit those models training in one or even multiple GPUs within the same server.

This leads not only to issues around the scalability of DL solutions but, in the worst case, to a decay of productivity in the data teams that might have to wait for several days to see the effect of their changes on the model’s performance. This hampers not only teams’ productivity, but also the organization’s business value, as they might not be able to deliver analytics and insights promptly. In a way, models’ distributed training can help us tackle these challenges.

Moving towards distribution

Distributed DL can be achieved through two different methods of parallelism:

Data Parallelism: Here we partition the data and send them to the computational nodes (worker machines). The number of parts is equal to the number of worker machines. Each machine performs the necessary computations independently and in the end, we synchronize the learned parameters.
Model Parallelism: As the name suggests here we partition the model itself instead of the data. Each machine computes a specific part of the model and synchronizes the learned parameters at the end. This is harder to implement because there is no specific rule to partition a model, we need to come up with a different approach for each model.

If you need more details, check this article.

If the concepts above sound too complex for you, don’t worry! There are several tools available that ease our jobs to scale DL models training while keeping the solution simple enough.

Next, I’ll be covering some of them.

Horovod

Developed by Uber, Horovod is a distributed training framework, that works seamlessly for TensorFlow, Keras, PyTorch, and MXNext. The paradigm of data-parallel distributed training under Horovod is quite straightforward — first runs multiple copies of the training script in each machine, then averages the gradients among those multiple copies and updates a centralized model until the training is considered completed. In terms of coding, to start using Horovod is quite easy. Here you can check some code examples.

It was one of the first open-sourced frameworks to distribute DL models training. It has become extremely popular within the Machine Learning community and had been adopted by research teams and AI-based companies such as DeepMind and OpenAI.

But there are challenges when using this framework for multinode architectures, as it requires additional hardware and networking configuration.

PyTorch

This was developed by Facebook and has started gaining huge popularity in recent years. PyTorch offers elegant and easy to use APIs to help you make your training faster while distributing it. PyTorch provides two main approaches for parallelizing training: DataParallel (DP) and DistributedDataParallel (DDP).

DP is a simpler and more straightforward approach. It requires minimal changes to the training code. It works by slicing up a training data batch into further smaller, sub-batches (equal to GPU count). DDP involves multiple python processes that must be synchronized at appropriate points in code. DDP provides a greater balance of workload across workers and ensures a lower communication overhead. In DP, local gradients are collected on the master node for performing aggregation, and model updates are then broadcast back to workers, whereas in DDP, gradients are aggregated across all GPUs.

These methods are quite straightforward to be included in your code. You can check some examples here.

TensorFlow

This is arguably the most popular tool for DL and distributed DL. TensorFlow has evolved from DistBelief (one of the earliest distributed DL tools by Google) and retains the concept of computation graph, parameter server, etc.

TensorFlow2.0 introduced a new API - tf.distributed.Strategy - that can be used to distribute training across multiple GPUs, multiple machines, or TPUs. This allows you to distribute your existing models and training with minimal changes in code. You can check some code examples here.

This API offers several different strategies, for different purposes, depending on your requirements. With the provided API, you can go from one machine with a single device to one machine with multiple devices and finally, from multiple machines with multiple devices to multiple machines with multiple devices each on a connected network.

With TensorFlow 2.0 Google’s team has also integrated Keras, reducing the high learning curve, previously associated with TensorFlow. Plus, this framework also offers scalable data-preprocessing (TF-Extended) and privacy-preserving built-in methods (TF-Privacy).

In this article we’ve discussed some of the mostly used toolkits that you can use to distribute your DL models training. Generally, the tools available in the market already provided a straightforward code implementation to distribute the training of networks, nevertheless, all of them come with the overhead of choosing wisely the machines and architecture to be used.

The next articles will cover in more depth the code implementation distributed DL models using PyTorch and TensorFlow 2.0.

Fabiana Clemente is Chief Data Officer at YData

Going distributed

Moving towards distribution

Horovod

PyTorch

TensorFlow

Written by Fabiana Clemente