A guest post by Mr Prabhat (NERSC) and Mike Houston (NVIDIA)
In a breakthrough achievement from 2018, our joint team from NERSC and NVIDIA succeeded in scaling a scientific Deep Learning application to 27,000+ Nvidia V100 Tensor Core GPUs, breaking the ExaFLOP barrier in the process. The accomplishment was awarded the ACM Gordon Bell Prize, the highest recognition in the field of high performance computing. Beyond the technical paper and press releases from 2018, this blog post provides a perspective on broader implications of this accomplishment for the field of AI and open challenges for the future.
DL Software: Performance and Productivity
There’s always been a trade off: low-level tools give developers precision, and high-level ones let them work fast. We’ve found a way around that. For our project, we used high productivity Python and TensorFlow to express the network architecture and the overall application workflow. TensorFlow, in turn, leveraged routines implemented in C and C++ for high performance, providing precision and freeing developers to move fast. As a result, over the course of eight months, our team prototyped a network from scratch, and optimized its performance and scaling on Summit, the world’s largest HPC system.
We believe that this project exemplifies what the long-speculated convergence of HPC and AI software stacks could actually look like: highly performant libraries (CuDNN) and frameworks (TensorFlow) written in C/C++, with a productive interface exposed through Python. Similarly, highly optimized, topology-aware communication collectives were implemented in NCCL and MPI, but exposed through the productive and simple Horovod interface. Going forward, we believe that transparent support for hybrid modes of parallelism (data, model, pipeline) will be critical for enabling scientists, and the broader research community, to explore even more complex architectures.
DL Hardware: GPUs and Mixed Precision
Our project utilized NVIDIA Volta GPUs for training the DeepLabv3+ segmentation network. The peak performance obtainable on a Volta is 125 teraflops in mixed-precision mode. This mode introduced by NVIDIA with Tensor Core GPU architecture performs computations in FP16 and accumulates results in FP32. Prior to this project, there was an open question in the field regarding whether realistic scientific applications could leverage FP16 (without loss of accuracy) and obtain a high fraction of peak performance. Our work conclusively demonstrated that for pattern recognition problems in science, 16-bit accuracy is likely sufficient. And moreover, for a complex application with over 4000 computational kernels, it is possible to achieve very high fraction of peak: our application obtained a peak performance of ~40 teraflops per GPU at massive scale.
We believe that these results open the floodgates for lower-precision accelerators for scientific applications. While the original datasets might be in high precision (64- or 32-bit), pattern recognition tasks can likely be conducted without loss of convergence or stability with low precision.
Achieving exascale level performance on a contemporary HPC system requires careful tuning of all components: hardware (CPUs, GPUs, NVLink, filesystem, network interconnect) and software. While our project was successful in tuning, optimizing and scaling to a large extent, we would like to call out two challenges that could leverage broader input from the industry and research community.
Data Management at Scale
Thanks to progress in GPU architectures on faster compute, we now enjoy 100+ teraflop level performance on a single silicon processor. As GPUs get faster, performance is limited by the ability to feed data to them. In our project, we analyzed a 20 terabyte dataset, which effectively required a level of ~4 terabyte per second sustained I/O rates across the Summit system. The GPFS filesystem on Summit was simply not up to the task; similar experiments conducted on the Lustre filesystem on NERSC’s Cori system have failed spectacularly. In both cases, staging the data on node local NVMe and the Burst Buffer technology turned out to be critical.
Traditional HPC file systems have been predominantly designed to support write-oriented workloads; DL workloads are read intensive, placing heavy demands on both data bandwidth and metadata operation rate. Transparent support for cache tiering, sharding and shuffling are likely going to be pre-requisites for supporting DL workloads at scale.
Convergence at Scale
Time to solution for deep learning comprises of two components: computational scaling efficiency and statistical scaling efficiency. Our work has demonstrated excellent computational scaling, with a number of pointers on system-level considerations. An open question is one of convergence properties of SGD (Stochastic Gradient Descent) in lieu of large batch sizes (27,000+ in our case).
The availability of a high degree of parallelism on HPC resources is a blessing and a curse. Running on a large resource at an unprecedented concurrency requires hyper-parameter tuning within a short time window. Heuristics on the behavior of the convergence algorithm may, or may not carry over from smaller scale runs. While empirical evaluation of the effect of choices of various hyper-parameters is reasonable for now, we anticipate the science community requiring better guiding principles, and convergence guarantees for novel problems.
We believe that further development of new algorithms, such as LARS and LARC, and potentially higher order optimization methods will be critical for tackling convergence issues.
In addition to Mike Houston and Prabhat, the Gordon Bell team comprises Thorsten Kurth (NERSC), Sean Treichler (NVIDIA), Josh Romero (NVIDIA), Massimiliano Fatica (NVIDIA), Nathan Luehr (NVIDIA), Everett Philips (NVIDIA), Ankur Mahesh (UC Berkeley), Mayur Mudigonda (UC Berkeley), Jack Deslippe (NERSC) and Michael Matheson (OLCF). We acknowledge advice and support provided by Paul Tucker, Rajat Monga and Thiru Palanisamy from the Google TensorFlow team.