
Scaling in the Deep Learning Applications
Disclaimer: The statements made and opinions expressed here are solely those of the author and do not necessarily reflect the views and opinions of my current and former employers, family and friends, and any dogs and cats that have lived with me during their and mine lifetime!
Increasing complexity of deep neural networks (DNNs), massive amount of data, and numerous parameters has made it challenging to exploit existing large-scale data processing pipelines for training and inference. A significant amount of research has been done to scale DNNs workload and leverage the existing “big data production pipelines”.
Let’s look at Deep Neural Nets as a standalone application to understand compute requirements better.

DNNs deliver a sophisticated modeling capability underpinned by multiple hidden layers that allows DNNs to have powerful nonlinear modeling capability. Training DNNs generally requires a large volume of data and a huge amount of computational resources. This leads to a long training time, ranging from several hours to many days, even with general-purpose graphics processing unit-based acceleration. Access to this level of resources is likely beyond those available to most deep learning practitioners. It is even less clear as to how to continue scaling significantly beyond a certain size of network.
Though some techniques, such as locally connected networks, and improved optimizers have enabled scaling by algorithmic advantage, another main approach has been to achieve scale through greater computing power.
We can achieve scalability in two ways:
- Using multiple machines in a large cluster to increase the available computing power also known as, (“scaling out”).
- Leveraging graphics processing units (GPUs), which can achieve greater computational throughput than typical CPUs (“scaling up”).
Scaling-Out
Let’s talk about the first point, where we use a large-scale cluster of machines to distribute training and inference in deep networks. Frameworks have evolved that enable model parallelism within a machine (via multithreading) and across machines (via message passing), with the details of parallelism, synchronization and communication managed by the framework.
The performance benefits of distributing a deep network across multiple machines depends on the connectivity structure and computational needs of the model. Models with a large number of parameters or high computational demands typically benefit from access to more CPUs and memory, up to the point where communication costs dominate.
To understand this distribution better, let’s look at different modes of implementing parallelism/concurrency in a deep learning framework.
- Data Parallelism: This is a map-reduce style distribution mechanism where the same task operates on different data sets across multiple nodes.

- Model Parallelism: When the model is partitioned across several machines, so that responsibility for the computation for different nodes is assigned to different machines.

The use of GPUs is a significant advance in recent years that makes the training of modestly sized deep networks practical. A clear advantage might be obtained if we can combine improvements in scale-out architectures with many GPUs. However, efficient implementations of the distributed optimization algorithms for machine learning applications are not easy. A major challenge is the inter-machine data communication.
Let’s look at this a bit deeper.
Scaling-out, GPUs and Scaling-up
Distributed optimization is becoming a key tool for solving large scale machine learning problems. The data, workloads or a combination of both, are partitioned into worker machines, which access the globally shared model as they simultaneously perform local computations to refine the model.
Bringing GPUs into the picture, make the computation of modestly sized networks practical. However, a known limitation of the GPU approach is that the training speed-up is small when the model does not fit in GPU memory (typically a few gigabytes). To use a GPU effectively, practitioners often reduce the size of the data or parameters so that CPU-to-GPU transfers are not a significant bottleneck. While data and parameter reduction work well for small problems (e.g. acoustic modeling for speech recognition), they are less attractive for problems with a large number of examples and dimensions (e.g., high-resolution images).
Now, to employ GPUs in a “data parallel” mode, where each GPU keeps a complete copy of the neural network parameters but computes a gradient using a different subset of the training data, the network parameters must fit on a single GPU — limiting us to the storage available on the GPU and hence a certain number of parameters. The GPU code is capable of computing a gradient for these parameters in just milliseconds per training image, yet copying parameters or gradients to other machines will take at least few seconds and yet over commodity Ethernet is several orders of magnitude slower.
Parallelizing the gradient computations with “model parallelism”, on the other hand, where each GPU is responsible for only a piece of the whole neural network, reduces bandwidth requirements considerably but also requires frequent synchronization (usually once for each forward- or backward-propagation step). This approach works well for GPUs in a single server (which share a high-speed bus) but is still too inefficient to be used with Ethernet networks.
Communication Costs and Networking
The performance benefits of distributing a deep network across multiple machines depends on the connectivity structure and computational needs of the model. Models with a large number of parameters or high computational demands typically benefit from access to more CPUs and memory, up to the point where communication costs dominate.
The most important feature for deep learning performance is memory bandwidth. The GPUs are optimized for memory bandwidth while sacrificing memory access time (latency). Attempting to build large clusters of GPUs is difficult due to communications bottlenecks.
Worker machines must frequently read and write the global shared parameters. This massive data access requires an enormous amount of network bandwidth. However, bandwidth is one of the scarcest resources in datacenters/cloud, often 10–100 times smaller than memory bandwidth and shared among all running applications and machines. This leads to a huge communication overhead and becomes a bottleneck for distributed optimization algorithms.
It is not uncommon, in early stages of a DNN PoC, to start with a single HPC node, which takes away the need to deal with framework complexity, for a small set of data/model. Due to high speed of intra-machine GPU-to-GPU communication (say on a PCIe 3.0), can attain quite a high speed up.
For most of the cluster workloads out there, the internet connectivity between nodes is quite slow. Even with a high speed network card, the speed up is not fast enough to keep up with the fast processing operations of GPUs. In these cases it makes sense to stick with a CPU-based cluster.
The use of high-end networking infrastructure to remove the communications bottleneck between servers enable us to exploit both fast GPU computation and to “scale out” to many servers. Incorporating Infiniband interconnects, which are dramatically faster (in terms of both bandwidth and latency) than typical Ethernet networks helps alleviate the bandwidth issue to a large extent. Specialized networking hardware and drivers, such as GPUDirect RDMA, can track and access the memory addresses of GPUs and have GPUs between two nodes communicate directly and can also give a significant boost in throughput.
Balancing CPUs and GPUs
Choosing a configuration to balance the number of GPUs with CPUs, is important for large-scale deep learning. Multi-GPU systems have demonstrated their ability to rapidly train very large neural networks (usually convolutional neural networks). Such systems rely on high-speed access to other GPUs across the host PCI bus to avoid communications bottlenecks and it makes sense to put many GPUs into a single server in this case. But this approach scales only to a few GPUs before the host machine becomes overburdened by I/O, power, cooling, and CPU compute demands.
AI accelerated Libraries
Using a library of optimized routines for providing a clear separation of concerns allows specialization: library providers can take advantage of their deep understanding of parallel architectures to provide optimal efficiency.
Deep learning frameworks can focus on higher-level issues rather than close optimization of parallel kernels to specific hardware platforms. Also, as parallel architectures evolve, library providers can provide performance portability, in much the same way as the BLAS routines provide performance portability to diverse applications on diverse hardware.
Frameworks
The user typically defines the computation that takes place at each node in each layer of the model, and the messages that should be passed during the upward and downward phases of computation. For large models, the user may partition the model across several machines, so that responsibility for the computation for different nodes is assigned to different machines. The framework automatically parallelizes computation in each machine using all available cores, and manages communication, synchronization and data transfer between machines during both training and inference.
The data sets, compute resources available, ease of use and SLA requirements may dictate which method/(s) of optimization may be suitable. There are a few things to consider when picking a framework:
- Data Set (Images/Videos or text)
- Accuracy requirement for training
- Trade-offs between SLAs (number of epochs) and accuracy requirement.
- Distributed Optimization methods available (or implementable) via the framework
- ** Model parallelism and Data parallelism or a combination of both
- ** Message Passing Interfaces
- **** MVAPICH
- **** OpenMPI
- **** Aeron
- **Data Parallelism
- **** Synchronous vs Asynchronous
- **** Centralized vs Distributed synchronization
- **** Parameter averaging vs. update (gradient)-based approaches
- Support for high speed network interfaces.
- Training or inference heavy or a combination of both.
Accessing the parameters requires an enormous amount of network bandwidth. Many machine learning algorithms are sequential and the resulting barriers hurt performance when the cost of synchronization and machine latency is high. Also, at scale fault tolerance is quite critical. Learning tasks are often performed in large clusters / in cloud where machines can be unreliable or preempted.
For these reasons as well as ease of implementation for providing fault tolerance, data parallelism has been a focus of more research and implementation in frameworks.
ASICs (Application-Specific Integrated Circuits)
Although most of the current algorithmic framework and hardware developments have centered around CPU, GPU, or a combination of two (training and inference respectively), I would be remiss if I did not broach the topic of ASICs and deep learning. Most of this blog focuses on GPUs (or a mix of CPU and GPUs), but we are looking forward to a ramped up field of compute units centered around Deep Learning applications.
ASIC-based compute units have performance far exceeding that of GPUs. Intel’s Nervana, recently announced and to be integrated into the upcoming Knight’s Mill CPU architecture, provides optimizations for single, half, and quarter precision instructions. Nervana and Google’s TPUs are a few emerging technologies in this area that provide excellent performance for targeted applications. Certain elements in a GPU needed for graphics processing are not necessary in a DL application, and by not including them and re-engineering the memory, the respective manufacturers claim, that these new processing units can attain a significant speedup when compared with GPUs. They also claim to provide faster speeds and bandwidth throughput across the nodes and Knight’s Mill specifically have addressable memory up to 400GB which is far more than a GPU.
FPGAs, on the other hand, are configurable, and several vendors offer a range of options such as integrated CPU cores, programmable logic and control and accelerated data flow, specifically programmed for inference and claimed to be faster than a CPU alternative.
The current architecture becomes obsolete faster with algorithmic advantages. Depending on cost, this may end up yielding cheaper (and still faster) clusters targeted for lower precision deep learning application optimized workloads as compared to using GPUs for multipurpose (DNNs, HPC and graphics) applications.
Summary
Deep Learning, with its explosive growth in frameworks and proven great accuracy for any given data set, has become an integral part of machine learning and big data analytical pipelines. Nowadays we have options to choose from an array of hardware configuration conditioned on datasets, SLA and accuracy requirements. And does not sacrifice upon the security and flow of the raw data through the rest of the data processing pipeline.
The research and development around the area of Deep Learning is rapidly evolving with respect to algorithms as well as hardware and computing resources. What may be apt today may not be as relevant and optimal tomorrow based on cost, SLA and accuracy needs. GPUs are great for DNN computing today, especially with development around high speed node connections as well as accelerated driver libraries. However, application targeted ASICs are appearing to be a viable and perhaps a more effective alternative.
Future work: Over next few blogs I will continue with a benchmark based on testing done across data sets, optimization techniques and a few chosen frameworks.
References
- “Google Research Publication: Large Scale Distributed Deep Networks.” https://research.google.com/archive/large_deep_networks_nips2012.html. Accessed 26 Aug. 2017.
- “Deep learning with COTS HPC systems.” http://proceedings.mlr.press/v28/coates13.html. Accessed 26 Aug. 2017.
- “On Optimization Methods for Deep Learning — Andrew Ng.” http://www.andrewng.org/portfolio/on-optimization-methods-for-deep-learning/. Accessed 26 Aug. 2017.
- “Scaling Distributed Machine Learning with the Parameter Server.” https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf. Accessed 26 Aug. 2017.
- “NVIDIA cuDNN | NVIDIA Developer.” https://developer.nvidia.com/cudnn. Accessed 26 Aug. 2017.
- “Developer Reference for Intel® Math Kernel Library 2017 — C | Intel ….” 10 May. 2017, https://software.intel.com/en-us/mkl-developer-reference-c. Accessed 26 Aug. 2017.
- “MVAPICH :: Home — The Ohio State University.” http://mvapich.cse.ohio-state.edu/. Accessed 26 Aug. 2017.
- “Open MPI.” https://www.open-mpi.org/. Accessed 26 Aug. 2017.
- “real-logic/aeron — GitHub.” https://github.com/real-logic/aeron. Accessed 26 Aug. 2017.
- “Intel + Nervana — Intel Nervana.” 9 Aug. 2016, https://www.intelnervana.com/intel-nervana/. Accessed 26 Aug. 2017.
- “How ‘Knights Mill’ Gets Its Deep Learning Flops — HPCwire.” 22 Jun. 2017, https://www.hpcwire.com/2017/06/22/knights-mill-gets-deep-learning-flops/. Accessed 26 Aug. 2017.
- “An in-depth look at Google’s first Tensor Processing Unit (TPU ….” 12 May. 2017, https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu. Accessed 26 Aug. 2017.
- “Hans Pabst Technical Computing Enabling … — CERN Indico.” 20 Mar. 2017, https://indico.cern.ch/event/595059/contributions/2499304/attachments/1430242/2196659/Intel_and_ML_Talk_HansPabst.pdf. Accessed 26 Aug. 2017.
- “Reinders: “AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable ….” 29 Jun. 2017, https://www.hpcwire.com/2017/06/29/reinders-avx-512-may-hidden-gem-intel-xeon-scalable-processors/. Accessed 26 Aug. 2017.