AMD’s Open Source Deep Learning Strategy

Published in

Intuition Machine

8 min readSep 21, 2017

When a company starts using disruptive technology or a disruptive business model, the results can be spectacular and can leave the competition eating dust.

The reason for this is that although the company’s growth seems linear at first, it eventually reveals itself as being exponential. When a company reaches this point, it becomes very difficult, if not impossible, for competitors to catch up.

This article explores AMD’s open source deep learning strategy and explains the benefits of AMD’s ROCm initiative to accelerating deep learning development. It asks if AMD’s competitors need to be concerned with the disruptive nature of what AMD is doing.

AMD’s Open Source Deep Learning Stack

Before we get into the detail of AMD’s deep learning stack, let’s look at the philosophy behind the development tooling. AMD, having a unique position of being both a CPU and GPU vendor, has been promoting the concept of a Heterogeneous System Architecture (HSA) for a number of years. Unlike most development tools from other vendors, AMD’s tooling is designed to support both their x86 based CPU and their GPU. AMD shares the HSA design and implementations in the HSA foundation (founded in 2012), a non-profit organization that has members including other CPU vendors like ARM, Qualcomm and Samsung.

The HSA foundation has an informative graphic that illustrates the HSA stack:

As you can see, the middleware (i.e. HSA Runtime Infrastructure) provides an abstraction layer between the different kinds of compute devices that reside in a single system. One can think of this as a virtual machine that allows the same program to be run on both a CPU and a GPU.

In November 2015, AMD announced the ROCm initiative to support High Performance Computing (HPC) workloads, and to provide an alternative to Nvidia’s CUDA platform. The initiative released an open source 64-bit Linux driver (known as the ROCk Kernel Driver) and an extended (i.e. non-standard) HSA runtime (known as the ROCr Runtime). ROCm also inherits previous HSA innovations such as AQL packets, user-mode queues and context-switching.

ROCm also released a C/C++ compiler called the Heterogeneous Compute Compiler (HCC) targeted to support HPC applications. HCC is based on the open-source LLVM compiler infrastructure project. There are many other open source versions of languages that use LLVM. Some examples are Ada, C#, Delphi, Fortran, Haskell, Java bytecode, Julia, Lua, Objective-C, Python, R, Ruby, Rust, and Swift. This rich ecosystem opens the possibility of alternative languages on the ROCm platform. One promising development of this kind is the Python implementation called NUMBA.

Added to the compiler is an API called HC which provides additional control over synchronization, data movement and memory allocation. HCC supports other parallel programming APIs, but to avoid further confusion, I will not mention them here.

The HCC compiler is based on work at the HSA foundation. This allows CPU and GPU code to be written in the same source file and supports capabilities such as a unified CPU-GPU memory space.

To further narrow the capability gap, the ROCm Initiative created a CUDA porting tool called HIP (let’s ignore what it stands for). HIP provides tooling that scans CUDA source code and converts it into corresponding HIP source code. HIP source code looks similar to CUDA code, but compiled HIP code can support both CUDA and AMD based GPU devices.

AMD took the Caffe framework with 55,000 lines of optimized CUDA code and applied their HIP tooling. 99.6% of the 55,000 lines of code was translated automatically. The remaining code took a week to complete by a single developer. Once ported, the HIP code performed as well as the original CUDA version.

HIP is not 100% compatible with CUDA, but it does provide a migration path for developers to support an alternative GPU platform. This is great for developers who already have a large CUDA code base.

Early this year AMD decided to get even “closer to the metal” by announcing the “Lightning Compiler Initiative.” This HCC compiler now supports the direct generation of the Radeon GPU instruction set (known as GSN ISA) instead of HSAIL.

As we shall see later, directly targeting native GPU instructions is critical to get higher performance. All the libraries under ROCm support GSN ISA.

The diagram depicts the relationships between the ROCm components. The HCC compiler generates both the CPU and GPU code. It uses different LLVM back ends to generate x86 and GCN ISA code from a single C/C++ source. A GSN ISA assembler can also be used as a source for the GCN target.

The CPU and GPU code are linked with the HCC runtime to form the application (compare this with HSA diagram). The application communicates with the ROCr driver that resides in user space in Linux. The ROCr driver uses a low latency mechanism (packet based AQL) to coordinate with the ROCk Kernel Driver.

This raises two key points about what is required for high-performance computation:

1. The ability to perform work at the assembly language level of a device.

2. The availability of highly optimized libraries.

In 2015, Peter Warden wrote, “Why GEMM is at the heart of deep learning” about the importance of optimized matrix libraries. BLAS (Basic Linear Algebra Subprograms) are hand-optimized libraries that trace their origins way back to Fortran code. Warden writes:

The Fortran world of scientific programmers has spent decades optimizing code to perform large matrix to matrix multiplications, and the benefits from the very regular patterns of memory access outweigh the wasteful storage costs.

This kind of attention to every detailed memory access is hard to replicate despite our advances in compiler technology. Warden went even further in 2017 when he wrote, “Why Deep learning Needs Assembler Hackers”:

I spend a large amount of my time worrying about instruction dependencies and all the other hardware details that we were supposed to be able to escape in the 21st century.

Despite being a very recent technology, software that enables deep learning is a complex stack. A common perception is that most deep learning frameworks (i.e. TensorFlow, Torch, Caffe etc) are open source. These frameworks are however built on highly optimized kernels that are often proprietary. Developers can go to great lengths to squeeze every ounce of performance from their hardware.

As an example, Scott Gray of Nervana systems had to reverse engineer Nvidia’s instruction set to create an assembler:

I basically came to the conclusion that it was not possible to fully utilize the hardware I bought with the tools Nvidia provides. Nvidia, unfortunately, doesn’t believe in eating their own dog food and they hand assemble their library routines, rather than use ptxas like the rest of us have to.

Gray used assembly language to write their kernels, thus creating algorithms that bested the proprietary alternatives. Now imagine how much less work he would have to do if the assembly language was available and documented. This is what AMD is bringing to the table.

The ROCm initiative provides the handcrafted libraries and assembly language tooling that will allow developers to extract every ounce of performance from AMD hardware.

This is implemented from scratch with a HIP interface. AMD has even provided a tool (i.e. Tensile) that supports the benchmarking of rocBLAS. AMD also provides an FFT library called rocFFT that is also written with HIP interfaces.

Deep learning algorithms continue to evolve at a rapid pace. In the beginning, frameworks exploited the available matrix multiplication libraries. These finely tuned algorithms have been developed over decades. As research continued, newer kinds of algorithms were proposed.

Thus came the need to go beyond generic matrix multiplication. Convolutional networks came along and this resulted in even more innovative algorithms. Today, many of these algorithms are crafted by hand using assembly language. These low-level tweaks can lead to remarkable performance improvements. For some operations (i.e. batch normalization), the performance increases 14 times compared to a non-optimized solution.

AMD released a library called MiOpen that includes handcrafted deep learning motivated optimizations. This library includes Radeon GPU-specific optimizations for operations and will likely include many of those described above. The MiOpencoin release coincided with the release of Caffe. This will allow application code that uses these frameworks to perform competitively on Radeon GPU hardware.

Many other state-of-the-art methods have not yet worked their way into proprietary deep learning libraries. These are proposed almost every day as new papers are published in Arxiv.

It would be very difficult for any vendor to keep up with such a furious pace. In the current situation, given the lack of transparency in development tools, developers are forced to wait, although they would rather be performing the coding and optimizations themselves. Fortunately, the open source ROCm initiative solves the problem.

Deployment

Throughout this article, we’ve discussed the promising aspects of the ROCm software stack. When the rubber meets the road, we need to discuss the kind of hardware that software will run on. There are many different scenarios where it makes sense to deploy deep learning. Contrary to popular belief, not everything needs to reside in the cloud. Self-driving cars or universal translation devices need to operate without connectivity.

Deep learning also has two primary modes of operation — “training” and “inference”. In the training mode, you would like to have the biggest, fastest GPUs on the planet and you want many of them. In inference mode, you still want fast, but the emphasis is on economic power consumption. We don’t want to drive our businesses to the ground by paying for expensive power.

In summary, you want a variety of hardware that operates in different contexts. That’s where AMD is in good position. AMD has recently announced some pretty impressive hardware that’s geared toward deep learning workloads. The product is called Radeon Instinct and it consists of several GPU cards: the MI6, MI8, and MI25. The number roughly corresponds to the number of operations the card can crank out. An MI6 can perform roughly 6 trillion floating-point operations per second (aka teraflops).

There is also promise at the embedded device level. AMD already supports custom CPU-GPU chips for Microsoft’s Xbox and Sony’s PlayStation. An AMD APU (i.e. CPUs with integrated GPUs) can also provide solutions for smaller form factor devices. The beauty of AMD’s strategy is that the same HSA based architecture is available for the developer in the smallest of footprints, as well as in the fastest servers. This breadth of hardware offerings allows deep learning developers a wealth of flexibility in deploying their solutions. Deep learning is progressing at breakneck speed and one can never predict the best way to deploy a solution.

Conclusion

Deep learning is a disruptive technology like the Internet and mobile computing that came before. Open source software has been the dominant platform that has enabled these technologies.

AMD combines these powerful principles with its open source ROCm initiative. On its own, this definitely has the potential of accelerating deep learning development. ROCm provides a comprehensive set of components that address the high performance computing needs, such as providing tools that are closer to the metal. These include hand-tuned libraries and support for assembly language tooling.

Future deep learning software will demand even greater optimizations that span many kinds of computing cores. In my view, AMD’s strategic vision of investing heavily in heterogeneous system architectures gives their platform a distinct edge.

AMD open source strategy is uniquely positioned to disrupt and take the lead in future deep learning developments.

Note: This is a excerpt of an article originally posted at: Radeon Instinct as “The Potential Disruptiveness of AMD’s Open Source Deep Learning Strategy”.

AMD’s Open Source Deep Learning Strategy

Written by Carlos E. Perez