LIBXSMM Brings Deep-learning “Lessons Learned” to Many HPC Applications

“Once we realized the performance limitation was memory bandwidth, we really had to get down to work”, Alexander Heinecke (Research Scientist, Intel Labs) stated in his presentation about LIBXSMM at the first Middle East meeting of the Intel Extreme Performance Users Group at KAUST (King Abdullah University of Science and Technology) in Saudi Arabia.

Many HPC applications can benefit by using LIBXSMM, a library targeting Intel processors for small, dense or sparse matrix multiplications, and small convolutions. LIBXSMM is a cross-organization effort that includes scientists from UCSD and Intel. The library is freely available on github.

Heinecke observes that HPC is moving in new directions, particularly as “Lower precision needs to play an important role in HPC these days as it sees enormous increase due to deep learning” along with fused simulations where operations from multiple tasks occur in parallel exploit the SIMD capability of modern CPUs.

Lower precision needs to play an important role in HPC these days as it sees enormous increase due to deep learning — Alexander Heinecke, Research Scientist at Intel Labs

Deep-learning research is redefining our understanding of numerical precision. Running with lower precision increases the effective data item throughput from memory. For example, using single-precision (32-bit) numbers instead of double-precision (64-bit) numbers means the processor can effectively fetch twice as many floating point values per second.

Running with reduced precision helps to speed the computationally intensive training of deep neural networks (DNNs).

LIBXSMM brings this same performance benefit to HPC applications, specifically to the widely used small and sparse matrix operations. Coupled with custom, architecture JIT (Just in Time) architecture specific compilation, the LIBXSMM library delivers both high performance and scaling to many generations of Intel processors. For example, LIBXSMM can achieve a very high percentage of peak theoretical performance for sparse matrix operations on Intel processors ranging from 78% on the latest Intel Xeon Phi 7295 (codename Knights Mill) processor containing 72 cores to 70% an Intel Xeon Scalable Processor 8180 processor with 28 cores. Older processors also benefit from LIBXSMM as well.

Extreme performance HPC examples

These benefits, coupled with fused simulations greatly benefit even high-precision HPC codes.

Heinecke made his performance and scaling claims concrete by reporting 64-bit floating-point performance exceeding 10 PF/s on the NERSC (National Energy Research Scientific Computing Center) Cori-II supercomputer along with scalability results from 32 to 3200 nodes on the Argonne Theta supercomputer.

More specifically, in collaboration with Intel, Alexander Breuer and Yifeng Cui at the San Diego Supercomputer Center (SDSC) at the University of California San Diego have developed a new seismic software package called EDGE (Extreme-Scale Discontinuous Galerkin Environment). The latest simulations use LIBXSMM and are the fastest in the world to date. Scientists can use this simulation to better predict ground motions to save lives and minimize property damage.

Figure 1: LOH.1 benchmark example mesh and material regions (Image courtesy ISC’16)

The UCSD blog reports that the UCSD researchers achieved 10.4 PFLOP/s (or 10.4 quadrillion calculations per second) on the NERSC Cori-II supercomputer. This broke the previous seismic performance record of 8.6 PFLOPS conducted on China’s Tianhe-2 supercomputer.

In collaboration with Intel, Alexander Breuer and Yifeng Cui of SDSC broke the previous seismic performance record of 8.6 PFLOPS conducted on China’s Tianhe-2 supercomputer. — UCSD blog.

The research projects is part of a collaboration announced in early 2016 under which Intel opened a computing center at SDSC to focus on seismic research.

The LIBXSMM github wiki shows that LIBXSMM library is a general HPC solution. Specifically, LIBXSMM has been used in CP2K (an open source molecular dynamic package), spectral element codes, the innovative PyFR Python-based computational fluid dynamics (CFD) package, and the Eigen Math library for automating driving workloads[i]. Not surprisingly, the wiki shows that LIBXSMM also benefits deep-learning packages as well.

Optimized matrix operations tie deep-learning to HPC via LIBXSMM

Heinecke observes that deep-learning is fast becoming the next “killer app” that exhibits very regular compute patterns that can be performed at lower precision. To benefit deep-learning, hardware vendors are introducing special function units like Intel’s QFMA to speed matrix multiplications. Intel notes that the Quad Fused Multiply Add (QFMA) “doubles the amount of single precision performance”. [ii]

This can benefit HPC applications, Heinecke points out. The most common deep-learning kernel is a GEMM (general matrix-matrix multiplication) and convolution, which are perfect for parallelization as they map to long chains of many independent inner products. Even better, convolutions exhibit greater spatial and temporal locality than GEMM, which can benefit sparse operators.

He states, “We can harvest this for HPC by leveraging a high amount of lower-precision flop/s”. For example running preconditioners with very low precision and evaluating if solvers can be run with FP32 instead of FP64, and running fused simulations to exploit the GEMM-like kernels where locality inside the operator can benefit sparse operations.

Lowering precision does translate to faster time to solution.

A big unknown when using reduced precision is the effect on the convergence behavior of a numerical algorithm. The concern is that the simulation may take longer to find a solution (e.g. converge to an acceptable solution) in terms of wall clock time even though the use of lower precision arithmetic types means the code runs faster on the hardware and performs more flop/s.

Heinecke used results from the LOH.1 benchmark in EDGE to show that faster performance does translate to faster runtimes when using lower precision. Stated another way, convergence was not a problem when using reduced precision.

Figure 2: Exemplary illustration of EDGE’s fourth order solution for the ninth receiver and quantity $u$ of the LOH.1 wave propagation benchmark. Plots a) and b) show a comparison to the reference, using double precision arithmetic. Out of the eight fused, identical solutions of the setting, only the first one is shown. Plots c) and d) show a comparison of the almost identical single and double precision results, obtained when using a single forward simulation. Due to the low misfits, shown in d), the FP32 in and FP64 solutions are visually indistinguishable in the raw receiver plot c). (Image courtesy Intel)

Extreme sparse matrix performance

EDGE exploits the ability of LIBXSMM to run fused simulations to increase performance. Without getting too technical, the EDGE solver can utilize the results of a number of different seismic sources in its forward solver as illustrated below. Succinctly, the solver uses LIBXSMM to perform the many small, sparse matrix operations in parallel at the same time.

Figure 3: Incorporating multiple seismic sources into the solver using fused simulations (Image courtesy UCSD)

It’s common in the scientific literature to refer to this as parallelizing over multiple right-hand sides of the PDEs, but the EDGE authors prefer the term fused simulation to highlight the advantages of their approach, which are[iii]:

1. The ability to perform full vector operations even when using sparse matrix-operations by fusing multiples of the vector-width. Non-fused operations can have up-to a 50% zero padding, which represents wasted flop/s.

2. Data structures are automatically aligned for vector operations.

3. Read-only data structures can substantially increase arithmetic intensities.

4. Fused simulations are less sensitive to memory latencies and conditional jumps. This reduces the performance penalty of start-up latencies or branch mispredictions. It also helps reduce network latencies due to larger MPI-messages having identical exchange-frequencies.

At the at the KAUST Extreme Computing Workshop, Heinecke reported results from the SDSC effort demonstrating LIBXSMM can achieve a very high percentage of peak theoretical performance for sparse matrix operations on Intel processors ranging from 78% on an Intel Xeon Phi 7295 (codename Knights Mill) with 72 cores to 70% an Intel Xeon Scalable Processor 8180 processor with 28 cores. This is within a few percent of the performance of a highly optimized dense matrix DGEMM (double-precision GEMM) operation, which highlights the benefits of the LIBXSMM JIT code generation and fused operations for sparse matrix operations.

Fused simulations also help distributed memory computations as well. The following image shows both high floating-point performance and peak efficiency when running the LOH.1 benchmark in EDGE on 16 nodes using various generations of Intel processors — namely Intel Xeon Scalable processors, Intel Xeon Phi (KNL), and Intel Xeon Phi for machine learning (KNM).

Figure 4: Dark grey: non-fused simulation, Light gray: fused simulation (Image courtesy Intel)

Summary

Small matrix multiplication kernels are very common in scientific applications. Having a library that generates code for the specific processor architecture and which can achieve a high percentage of peak performance — particularly on sparse matrices — is invaluable. The forward thinking use of fused simulations and the use of hardware optimized lower-precision deep-learning capabilities indicate that LIBXSMM will benefit HPC applications even more so in the future as hardware vendors deploy greater numbers of deep-learning optimized products. Up to now LIBXSMM has been successfully integrated into other scientific code which employ small matrix multiplications: The Gordon Bell winning Nek5000/Box family, CP2K (Gordon Bell finalist) and SeisSol (Gordon Bell finalist and SC Best Paper winner) but also the deep learning framework Tensorflow and the linear algebra C++package Eigen.

Rob Farber is a global technology consultant and author with an extensive background in HPC and advanced computational technology that he applies at national labs and commercial organizations. Rob can be reached at info@techenablement.com

[i] https://software.intel.com/sites/default/files/managed/4f/73/parallel-universe-issue-31.pdf

[ii] https://itpeernetwork.intel.com/accelerating-deep-learning-workloads/

[iii] For more information see http://hypocenter.usc.edu/research/SSA/breuera_ssa.pdf