Glow Compiler Optimizes Neural Networks for Low Power NXP MCUs

Published in

PyTorch

7 min readSep 2, 2020

Authored by Markus Levy, Director of AI and Machine Learning Technologies at NXP Semiconductors

This may be a first for you: Python and low-cost microcontrollers discussed in the same blog. Intrigued? Continue reading to learn about the relationship between them.

The world of machine learning, and more specifically Deep Learning, is a rapidly growing field. It’s growing in the sense of how quickly the market is expanding, especially as deep learning moves to the edge. In my microcosm at NXP, I see the deep learning customer base increasing dramatically, as more and more engineers build applications that include some form of vision- or voice-based machine learning technology. The number of deep learning frameworks, tools, and other capabilities that have become available to allow developers to build and deploy neural network models also keeps expanding.

One example of such a tool is the Glow neural network (NN) model compiler. Aligned with the proliferation of deep learning frameworks such as PyTorch, NN compilers provide optimizations to accelerate inference performance on a range of hardware platforms. In May 2018, Facebook introduced Glow (the Graph Lowering compiler) as an open source community project and it has evolved significantly over the last two years thanks to the efforts of more than 130 worldwide contributors.

Recently, we rolled out our official support for this Glow compiler — and we’re very excited about the performance and memory benefits it’s delivering for our devices. We have tightly integrated Glow into our MCUXpresso SDK which packages the Glow compiler and quantization tools into an easy to use installer along with detailed documentation and labs to get running quickly with your own models.

For clarification as you read the remainder of this article, here are the definitions that I’m attaining to:

PyTorch = an open-source machine learning framework for Python programs that facilitates building deep learning projects and products.
Glow Compiler = A backend for high-level machine learning frameworks, designed to allow compiler optimizations and code generation of neural network graphs.

Glow Flexible Functionality

As an NN compiler, Glow takes in a computation graph and generates highly optimized machine code over two phases. In the first phase it optimizes the operators and layers of the model using standard compiler techniques such as kernel fusion, lowering of complex operations to simple kernels, and transpose elimination. In the second, or backend phase of the model compilation, Glow compiler uses LLVM modules to enable target-specific optimizations. Actually, depending on the environment, Glow supports two compilation modes — 1) Just in Time (JIT) compilation, where compilation is performed at runtime just before the model is executed; and 2) Ahead of Time (AOT) compilation, where compilation is performed offline to generate an object file (called a Glow bundle) which is later linked with the user’s application code. For the former mode, Facebook and other data center providers use JIT to enable flexible processing on hardware accelerators deployed in cloud-based processing of complex person or face recognition models. For the latter mode in which an object file is generated, all unnecessary overhead is eliminated reducing the number of computations as well as the memory overhead. This latter mode is ideal for deploying on memory-constrained and low-cost microcontrollers.

Target-Specific Optimizations

Before diving into the implementation for Glow on MCUs, I’ll provide a little background here on the initial devices we targeted from the i.MX RT series of crossover MCUs. We use the label ‘crossover’ because while it possesses the functionality of an MCU, it has the performance of an MPU — containing Arm® Cortex®-M cores running from 300–1000 MHz. While any device in our i.MX RT series will run a Glow compiled model, we started our testing on the i.MX RT1060 because we also have TensorFlow running on this device, and it allowed us to have a direct performance comparison. We also started with the i.MX RT685 because this is a new device and the only one in our i.MX RT series with a DSP optimized for processing neural network operators. The i.MX RT1060, contains a 600 MHz Cortex-M7, 1MB of SRAM, as well as features ideal for real-time applications such as high-speed GPIO, CAN-FD, and synchronous parallel NAND/NOR/PSRAM controller. The i.MX RT685, contains a 600 MHz Cadence® Tensilica® HiFi 4 DSP core paired with a 300 MHz Cortex-M33 core and 4.5 MB of on-chip SRAM, along with a variety of security related features.

The standard version of Glow from GitHub is device agnostic; it can compile for basic architectures of interest. For example, for cross-compiling a bundle for the Arm Cortex-M7 core, use the command line -target=arm -mcpu=cortex-m7. However, as I mentioned, Glow has an LLVM backend and is capable of cross-compiling bundles for different target architectures. NXP has taken advantage of this backend support by using Arm CMSIS-NN to leverage the full capability of the Cortex-M7 as well as the memory subsystem of the i.MX RT1060 device. CMSIS-NN is an Arm-developed library supporting Arm Cortex-M0, -M3, -M4, -M7 and -M33 cores, and it implements standard NN operations like convolution, fully connected, pooling, and activation. Simply use the compilation flag -use-cmsis when building quantized bundles, and you will see the performance increase significantly above the standard compilation. For example, as measured by NXP on a CIFAR-10 model, performance increases by almost 2x, when using the CMSIS-NN library to accelerate NN operations.

The RT685’s HiFi 4 DSP core was originally designed to accelerate voice processing, but it is also capable of accelerating a wide range of NN operators when used with Cadence’s NN library (NNLib) as another LLVM backend for Glow. NN Lib is like CMSIS-NN, except it provides a much more comprehensive set of hand-tuned operators optimized for the HiFi4 DSP. For the same CIFAR-10 example, this DSP delivers a 25x performance increase compared to the Glow standard implementation.

What happens when compiling a model that contains operators not supported by either the CMSIS-NN or the NNLib? In this case, I draw an analogy to using intrinsic functions with a standard compiler. When building the machine code, the compiler uses the intrinsic that is available for the specific function, otherwise it builds the code for the native architecture.

PyTorch for Embedded Systems

Until recently, ONNX and Caffe 2 were the only input model formats supported by Glow. PyTorch can directly export models into the ONNX format for use by Glow. Alternatively, since many well-known models were created in other formats (e.g. TensorFlow™), there are also open source model conversion tools to convert them to the ONNX format. The most used tools for format conversion are MMDNN, a set of tools supported by Microsoft® to help users inter-operate among different deep learning frameworks, and tf2onnx to convert TensorFlow models to ONNX. Furthermore, NXP has upstreamed to the Glow community a support feature to allow TensorFlow Lite models to be brought in directly to Glow. More recently, Glow can be directly accessed through PyTorch, allowing users to build and compile their models in the same development environment, thereby eliminating steps and simplifying the compilation process.

However, because of its broad use in datacenters by companies such as Facebook, people have questioned PyTorch’s ability to serve as a framework for embedded MCUs. With Glow becoming directly accessible from PyTorch, is there reason to be concerned that ’PyTorch, and hence Glow, is not targeted at MCUs?’ The short answer is ‘no’, especially given the AOT implementation of Glow.

To explain this further, it would be a valid impression that PyTorch itself isn’t targeted towards MCUs for many reasons. First, it’s a community project and no one has stepped up yet to develop and maintain this approach. For obvious reasons, Facebook does not use PyTorch on MCUs, but the embedded community is welcome to contribute and add the end-to-end support for MCUs and embedded platforms in general. PyTorch Mobile includes an Arm-based runtime/interpreter with support for iOS and Android™. Essentially the same could be done for a standard embedded Linux® distro, taking advantage of PyTorch’s unified Python or C++ front-ends with multiple back-ends. This would allow support of the JIT compilation approach. I suspect that this is only a matter of time because of the growing attraction to PyTorch, especially among academic and research users. According to statistics [1], PyTorch’s dominance is strongest at vision and language conferences (outnumbering TensorFlow by 2:1 and 3:1 respectively), and PyTorch has also become more popular than TensorFlow at general machine learning conferences like ICLR and ICML. Eventually some of these researchers will migrate into the industrial space and adapt PyTorch for the edge computing environment.

To specifically address the question of PyTorch as a good choice for MCUs, since it can generate ONNX models which can be compiled by Glow, processing platform restrictions are minimal. And with Glow as an extension of PyTorch, it will be even easier to generate bundles. The user can generate bundles directly from the Python script, without having to first generate ONNX models.

We’ve recently rolled out our official support for Glow — it’s tightly integrated into our MCUXpresso SDK with several project examples. We’ve also packaged the Glow compile and quantization tools into an easy to use installer with detailed documentation and labs to get up and running quickly with your own models. With such significant performance and memory benefits, this compiler will be a great boon for embedded system developers deploying machine learning with the NXP i.MX RT family.

About the author:

Markus Levy joined NXP in 2017 as the Director of AI and Machine Learning Technologies. In this position, he is focused primarily on the technical strategy, roadmap, and marketing of AI and machine learning capabilities for NXP’s microcontroller and i.MX applications processor product lines. Previously, Markus was chairman of the board of EEMBC, which he founded and ran as the President since April 1997. Mr. Levy was also president of the Multicore Association, which he co-founded in 2005. Previously, he was Senior Analyst at Microprocessor Report and an editor at EDN magazine. Markus began his career at Intel Corporation, as both a Senior Applications Engineer and customer training specialist for Intel’s microprocessor and flash memory products. Markus volunteered for thirteen years as a first responder — fighting fires and saving lives.

Glow Compiler Optimizes Neural Networks for Low Power NXP MCUs

Glow Flexible Functionality

Target-Specific Optimizations

PyTorch for Embedded Systems

About the author:

Written by PyTorch