Kashi — 2013 December

Scaling in the Deep Learning Applications

Vartika Singh
Aug 27, 2017 · 10 min read

Disclaimer: The statements made and opinions expressed here are solely those of the author and do not necessarily reflect the views and opinions of my current and former employers, family and friends, and any dogs and cats that have lived with me during their and mine lifetime!

Increasing complexity of deep neural networks (DNNs), massive amount of data, and numerous parameters has made it challenging to exploit existing large-scale data processing pipelines for training and inference. A significant amount of research has been done to scale DNNs workload and leverage the existing “big data production pipelines”.

Let’s look at Deep Neural Nets as a standalone application to understand compute requirements better.

DNNs deliver a sophisticated modeling capability underpinned by multiple hidden layers that allows DNNs to have powerful nonlinear modeling capability. Training DNNs generally requires a large volume of data and a huge amount of computational resources. This leads to a long training time, ranging from several hours to many days, even with general-purpose graphics processing unit-based acceleration. Access to this level of resources is likely beyond those available to most deep learning practitioners. It is even less clear as to how to continue scaling significantly beyond a certain size of network.

Though some techniques, such as locally connected networks, and improved optimizers have enabled scaling by algorithmic advantage, another main approach has been to achieve scale through greater computing power.

We can achieve scalability in two ways:

  1. Using multiple machines in a large cluster to increase the available computing power also known as, (“scaling out”).
  2. Leveraging graphics processing units (GPUs), which can achieve greater computational throughput than typical CPUs (“scaling up”).

Scaling-Out

The performance benefits of distributing a deep network across multiple machines depends on the connectivity structure and computational needs of the model. Models with a large number of parameters or high computational demands typically benefit from access to more CPUs and memory, up to the point where communication costs dominate.

To understand this distribution better, let’s look at different modes of implementing parallelism/concurrency in a deep learning framework.

  • Data Parallelism: This is a map-reduce style distribution mechanism where the same task operates on different data sets across multiple nodes.
Data Parallelism
  • Model Parallelism: When the model is partitioned across several machines, so that responsibility for the computation for different nodes is assigned to different machines.
Model Parallelism

The use of GPUs is a significant advance in recent years that makes the training of modestly sized deep networks practical. A clear advantage might be obtained if we can combine improvements in scale-out architectures with many GPUs. However, efficient implementations of the distributed optimization algorithms for machine learning applications are not easy. A major challenge is the inter-machine data communication.

Let’s look at this a bit deeper.

Scaling-out, GPUs and Scaling-up

Bringing GPUs into the picture, make the computation of modestly sized networks practical. However, a known limitation of the GPU approach is that the training speed-up is small when the model does not fit in GPU memory (typically a few gigabytes). To use a GPU effectively, practitioners often reduce the size of the data or parameters so that CPU-to-GPU transfers are not a significant bottleneck. While data and parameter reduction work well for small problems (e.g. acoustic modeling for speech recognition), they are less attractive for problems with a large number of examples and dimensions (e.g., high-resolution images).

Now, to employ GPUs in a “data parallel” mode, where each GPU keeps a complete copy of the neural network parameters but computes a gradient using a different subset of the training data, the network parameters must fit on a single GPU — limiting us to the storage available on the GPU and hence a certain number of parameters. The GPU code is capable of computing a gradient for these parameters in just milliseconds per training image, yet copying parameters or gradients to other machines will take at least few seconds and yet over commodity Ethernet is several orders of magnitude slower.

Parallelizing the gradient computations with “model parallelism”, on the other hand, where each GPU is responsible for only a piece of the whole neural network, reduces bandwidth requirements considerably but also requires frequent synchronization (usually once for each forward- or backward-propagation step). This approach works well for GPUs in a single server (which share a high-speed bus) but is still too inefficient to be used with Ethernet networks.

Communication Costs and Networking

The most important feature for deep learning performance is memory bandwidth. The GPUs are optimized for memory bandwidth while sacrificing memory access time (latency). Attempting to build large clusters of GPUs is difficult due to communications bottlenecks.

Worker machines must frequently read and write the global shared parameters. This massive data access requires an enormous amount of network bandwidth. However, bandwidth is one of the scarcest resources in datacenters/cloud, often 10–100 times smaller than memory bandwidth and shared among all running applications and machines. This leads to a huge communication overhead and becomes a bottleneck for distributed optimization algorithms.

It is not uncommon, in early stages of a DNN PoC, to start with a single HPC node, which takes away the need to deal with framework complexity, for a small set of data/model. Due to high speed of intra-machine GPU-to-GPU communication (say on a PCIe 3.0), can attain quite a high speed up.

For most of the cluster workloads out there, the internet connectivity between nodes is quite slow. Even with a high speed network card, the speed up is not fast enough to keep up with the fast processing operations of GPUs. In these cases it makes sense to stick with a CPU-based cluster.

The use of high-end networking infrastructure to remove the communications bottleneck between servers enable us to exploit both fast GPU computation and to “scale out” to many servers. Incorporating Infiniband interconnects, which are dramatically faster (in terms of both bandwidth and latency) than typical Ethernet networks helps alleviate the bandwidth issue to a large extent. Specialized networking hardware and drivers, such as GPUDirect RDMA, can track and access the memory addresses of GPUs and have GPUs between two nodes communicate directly and can also give a significant boost in throughput.

Balancing CPUs and GPUs

AI accelerated Libraries

Deep learning frameworks can focus on higher-level issues rather than close optimization of parallel kernels to specific hardware platforms. Also, as parallel architectures evolve, library providers can provide performance portability, in much the same way as the BLAS routines provide performance portability to diverse applications on diverse hardware.

Frameworks

The data sets, compute resources available, ease of use and SLA requirements may dictate which method/(s) of optimization may be suitable. There are a few things to consider when picking a framework:

  • Data Set (Images/Videos or text)
  • Accuracy requirement for training
  • Trade-offs between SLAs (number of epochs) and accuracy requirement.
  • Distributed Optimization methods available (or implementable) via the framework
  • ** Model parallelism and Data parallelism or a combination of both
  • ** Message Passing Interfaces
  • **** MVAPICH
  • **** OpenMPI
  • **** Aeron
  • **Data Parallelism
  • **** Synchronous vs Asynchronous
  • **** Centralized vs Distributed synchronization
  • **** Parameter averaging vs. update (gradient)-based approaches
  • Support for high speed network interfaces.
  • Training or inference heavy or a combination of both.

Accessing the parameters requires an enormous amount of network bandwidth. Many machine learning algorithms are sequential and the resulting barriers hurt performance when the cost of synchronization and machine latency is high. Also, at scale fault tolerance is quite critical. Learning tasks are often performed in large clusters / in cloud where machines can be unreliable or preempted.

For these reasons as well as ease of implementation for providing fault tolerance, data parallelism has been a focus of more research and implementation in frameworks.

ASICs (Application-Specific Integrated Circuits)

ASIC-based compute units have performance far exceeding that of GPUs. Intel’s Nervana, recently announced and to be integrated into the upcoming Knight’s Mill CPU architecture, provides optimizations for single, half, and quarter precision instructions. Nervana and Google’s TPUs are a few emerging technologies in this area that provide excellent performance for targeted applications. Certain elements in a GPU needed for graphics processing are not necessary in a DL application, and by not including them and re-engineering the memory, the respective manufacturers claim, that these new processing units can attain a significant speedup when compared with GPUs. They also claim to provide faster speeds and bandwidth throughput across the nodes and Knight’s Mill specifically have addressable memory up to 400GB which is far more than a GPU.

FPGAs, on the other hand, are configurable, and several vendors offer a range of options such as integrated CPU cores, programmable logic and control and accelerated data flow, specifically programmed for inference and claimed to be faster than a CPU alternative.

The current architecture becomes obsolete faster with algorithmic advantages. Depending on cost, this may end up yielding cheaper (and still faster) clusters targeted for lower precision deep learning application optimized workloads as compared to using GPUs for multipurpose (DNNs, HPC and graphics) applications.

Summary

The research and development around the area of Deep Learning is rapidly evolving with respect to algorithms as well as hardware and computing resources. What may be apt today may not be as relevant and optimal tomorrow based on cost, SLA and accuracy needs. GPUs are great for DNN computing today, especially with development around high speed node connections as well as accelerated driver libraries. However, application targeted ASICs are appearing to be a viable and perhaps a more effective alternative.

Future work: Over next few blogs I will continue with a benchmark based on testing done across data sets, optimization techniques and a few chosen frameworks.

References

  1. “Deep learning with COTS HPC systems.” http://proceedings.mlr.press/v28/coates13.html. Accessed 26 Aug. 2017.
  2. “On Optimization Methods for Deep Learning — Andrew Ng.” http://www.andrewng.org/portfolio/on-optimization-methods-for-deep-learning/. Accessed 26 Aug. 2017.
  3. “Scaling Distributed Machine Learning with the Parameter Server.” https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf. Accessed 26 Aug. 2017.
  4. “NVIDIA cuDNN | NVIDIA Developer.” https://developer.nvidia.com/cudnn. Accessed 26 Aug. 2017.
  5. “Developer Reference for Intel® Math Kernel Library 2017 — C | Intel ….” 10 May. 2017, https://software.intel.com/en-us/mkl-developer-reference-c. Accessed 26 Aug. 2017.
  6. “MVAPICH :: Home — The Ohio State University.” http://mvapich.cse.ohio-state.edu/. Accessed 26 Aug. 2017.
  7. “Open MPI.” https://www.open-mpi.org/. Accessed 26 Aug. 2017.
  8. “real-logic/aeron — GitHub.” https://github.com/real-logic/aeron. Accessed 26 Aug. 2017.
  9. “Intel + Nervana — Intel Nervana.” 9 Aug. 2016, https://www.intelnervana.com/intel-nervana/. Accessed 26 Aug. 2017.
  10. “How ‘Knights Mill’ Gets Its Deep Learning Flops — HPCwire.” 22 Jun. 2017, https://www.hpcwire.com/2017/06/22/knights-mill-gets-deep-learning-flops/. Accessed 26 Aug. 2017.
  11. “An in-depth look at Google’s first Tensor Processing Unit (TPU ….” 12 May. 2017, https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu. Accessed 26 Aug. 2017.
  12. “Hans Pabst Technical Computing Enabling … — CERN Indico.” 20 Mar. 2017, https://indico.cern.ch/event/595059/contributions/2499304/attachments/1430242/2196659/Intel_and_ML_Talk_HansPabst.pdf. Accessed 26 Aug. 2017.
  13. “Reinders: “AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable ….” 29 Jun. 2017, https://www.hpcwire.com/2017/06/29/reinders-avx-512-may-hidden-gem-intel-xeon-scalable-processors/. Accessed 26 Aug. 2017.

)
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade