It has been a consensus that the company which enables real intelligence on edge devices, such as mobile devices and IoT devices, will define the future of computing. Racing towards this goal, many companies, whether giant technology firms such as Google, Microsoft, Amazon, Apple and Facebook, or startups have spent tens of billions of dollars each year on R&D. Assuming hardware is the major constraint for enabling real mobile intelligence, the industry has mainly dedicated their efforts to developing specialized hardware accelerators for machine learning and inference. Billions of dollars have been spent to fuel this intelligent hardware race.
In this article, we respectfully disagree with this approach and strongly believe that software is still eating the world, even in the AI era. Our central thesis is that the potential of software optimization for deep learning applications is still far from being fully exploited. Once software optimization is done right, we can immediately enable real-time deep learning on billions of existing mobile devices, hence unlocking a trillion dollar market.
In the rest of this article, we review the landscape of AI hardware, different approaches of software optimization, and delve into what we think is the most promising approach, the Compression-Compilation Co-Design approach. We conclude that software is still eating the world, even in the AI era, and enabling real-time AI applications on billions of existing mobile devices and trillions of up-and-coming IoT devices through software-only compression-compilation co-design is the most tangible and feasible approach.
The Landscape of AI Hardware
Assuming hardware is the major constraint for enabling real-time mobile intelligence, the industry has mainly dedicated their efforts to developing specialized hardware accelerators for machine learning and inference.
The arrival of startup silicon on the AI computing market follows several years of intense competition between chip makers Intel and rivals including NVIDIA, AMD and several players advancing ARM technology. Today there are more than 100 AI chip startups in the US, Europe and Asia, from companies reinventing programmable logic and multi-core designs, to those developing their own entirely new architectures, to those using futuristic technologies such as neuromorphic architectures. Tens of billions of dollars of venture funding has been poured into this market to support these startups. In addition, fueling the competitions between the major chip makers, we have seen extremely costly acquisitions of MobilEye, Movidius, and Altera by Intel, acquisition of DeePhi by Xilinx, development of TPU by Google, along with significant investments in autonomous driving processors by NVIDIA and Tesla etc. Despite the tremendous amount of input into this market, the output is disappointing thus far, as we still have not yet seen any large-scale deployment of the edge AI accelerators. This led us to ponder whether hardware acceleration is the right approach, or is software still eating the world in the AI era?
After a careful study, we advocate that with effective compression-compiler co-design, it is feasible to enable real-time artificial intelligence (AI) on existing edge devices without special hardware accelerators. The principle of compression-compilation co-design is to design the compression of Deep Learning Models and their compilation to executables in a hand-in-hand manner. This synergistic method can effectively optimize both the size and speed of Deep Learning models, and also can dramatically shorten the tuning time of the compression process, largely reducing the time to the market of AI products.
When applied to models running on mainstream edge devices, the method can produce real-time experience across a set of AI applications that had been broadly perceived possible only with special AI accelerators. Foregoing the need for special hardware for real-time AI has some profound implications, thanks to the multi-fold advantages of mainstream processors over special hardware:
• Time to market: special hardware often takes multiple years before it reaches the market. The creation of the associated compiler and system software for the newly developed hardware accelerators further lengthens the process. Applications using such hardware often needs to use the special APIs and meet many special constraints (e.g., tiling computations to a certain size), which lengthens the time to market of AI product.
• Cost: developing a special ASIC processor is extremely costly, and adding them into existing systems incurs extra expenses.
• Technology maturity: Unlike general-purpose processors, special hardware has a much smaller production volume; the technology available for their production is hence usually several generations behind general-purpose processors. Most AI accelerators, for instance, are based on 28 to 65nm CMOS technology, with a transistor density over 10× lower than state-of-art mobile CPU or GPU.
• Speed: As a consequence of the old technology, special processors run much slower than general-purpose processors do.
• Eco-system: General-purpose processors have a well-developed eco-system (debugging tools, optimization tools, security measures), which makes the development of high-quality applications much easier than on special processors.
• Adoption: For all the above reasons, the adoption of a special processor is usually limited to the company that creates it and its few close customers. As a result, an AI application developed for the processor can be adopted by a limited number of devices.
The Software-Only Compression-Compilation Co-Design Approach
In this section, we present the details of the software-only Compression-Compilation Co-Design approach, which we believe will completely change the landscape of edge AI computing. Compression and compilation are the two key steps in fitting a deep learning model on a hardware for efficient executions. Model compression is a common technique for reducing the size and improving the speed of deep learning models. Compression techniques fall into two categories, pruning and quantization. Pruning removes layers or convolution filters or channels, while quantization reduces the precision of parameters (e.g., floating-point to short integer). Compilation refers to the process of generating executable code from a given deep learning model. In essence, compilation is a process of mapping the high-level operations in deep learning to the low-level instructions that the underlying hardware supports. The compilation process plays a critical role in optimizing the code for efficient executions.
The principle of compression-compilation co-design is to design the two components for AI in a hand-in-hand manner, and the synergy exhibit at three levels.
• Demands/Preferences Level: at this level, the synergy is on taking the preferences or de- mands of one component into consideration when designing the other component. An example is that main-stream processors typically prefer code with certain computation patterns; if model compression step can consider that preference, it could create a scenario more amendable for the compilation step to work effectively.
• Perspective/Insight Level: at this level, the synergy is on taking the perspective or insights in the domain of one component when treating the problems in the domain of the other component. An example is the principle of composability or modularity that has been playing an essential role in keeping programming systems and compilations efficient and scalable.
• Methodology Level: At this level, the synergy is on closely integrating the methodology of the two components together. For instance, through a compiler framework that automatically generates code to enable a new way of deep learning pruning, we can generate a speedup of up to 180X.
Specifically, we provide the Compression-Compilation Co-Design architecture in the above figure, which consists of the following components:
Pattern-based training stage performs effective kernel pattern and connectivity pruning in the training phase, in order to achieve the highest pruning (acceleration) rate without accuracy loss. First, we design a set of patterns to select for each kernel. Then we perform pattern pruning based on the designed pattern set and connectivity pruning, using an extended ADMM-based method.
Fine-grained DNN layer-wise representation (LR) provides a high-level representation to enable our general optimization on DNN models from various resources. In particular, the LR includes the pattern and tuning related information. The compiler optimizations rely on a series of improvements on the LR to generate the compact model and optimized execution codes.
Filter kernel reorder addresses two challenges of pattern-based pruning — heavy control-flow instructions, and thread divergence and load imbalance — by grouping the filters with similar lengths and patterns together. Because of the relatively limited number of patterns, the kernels with similar patterns can be organized together through proper filter kernel reordering, thereby significantly reducing the control-flow instructions and improving the instruction-level parallelism. Moreover, if different threads process different filters, thread divergence and load imbalance is- sues are properly resolved because the kernels in each filter have similar computation workload, thereby enhancing thread-level parallelism.
Compressed weight storage is specifically designed for our kernel pattern and connectivity pruning. Together with filter kernel reorder, this compact data structure yields much better compression rates than the conventional CSR (compressed sparse row) format.
Load redundancy elimination addresses the poor memory performance challenge of pattern-based pruning by exploring two novel register-level load redundancy opportunities during the kernel execution code generation. It is crucial, especially when the data movements between memory and cache have already been optimized with advanced tiling techniques.
Parameter auto-tuning specifically tests on different configurations of the key performance parameters, including strategies of placing data on various GPU memories, different tiling sizes, and loop permutations for each DNN layer on each processing unit.
In summary, allowing compilers to treat pruned kernels as special patterns, the Compression-Compilation Co-Design approach not only achieves high pruning rate with high accuracy, but also effectively converts the patterns into performance improvements for their hardware friendly properties.
Performance: Software vs. Hardware
To verify our central thesis that software is still eating the world, the key question we need to answer is whether on off-the-shelf devices, the Compression-Compilation Co-Design approach outperforms specialized hardware accelerators. We deployed the Compression-Compilation Co-Design framework on a Samsung Galaxy S10 smartphone, and compared its performance against hardware accelerators implemented on ASIC and FPGA.
The results are summarized in the figure above: first, the comparison results on performance and energy efficiency with special ASIC hardware including Google’s cloud TPU-V2 and edge TPU, NVIDIA Jetson AGX Xavier, Cambricon MLU-100, Eyeriss, etc., and second, comparison results on accuracy and energy efficiency with the FPGA solution ESE from DeePhi. This is a fair comparison on the same network models, and weight quantization is not adopted in our solution.
We can clearly observe that our solution on a off-the-shelf mobile device consistently outperforms representative ASIC/FPGA solutions in terms of energy efficiency. This unusual phenomenon is attributed to three reasons:
(i) smartphone itself has ultra-high energy efficiency. Smartphone computing chips are built using the most advanced technology (e.g., 7nm, 11nm technology) and are the key driving force of technology advancement, while FPGA/ASIC solutions are based on 28nm or 40nm technologies which are inherently less energy-efficient. Also ARM (for mobile CPU) and Qualcomm (for mobile GPU) are especially proficient in high- efficiency circuit/system designs.
(ii) while prior mobile compiler framework has limited support on different neural networks (e.g., not supporting RNNs or large-scale DNNs), our compiler can support all of the major types of neural networks, thereby unleashing the full potential of mobile devices.
(iii) our approach achieves consistently high performance on different DNN benchmarks thanks to the high flexibility of software-based solution. In contrast, it can be clearly observed that current ASIC/FPGA solutions are optimized for a specific DNN type/size, thereby lacking generality. Specifically, edge TPU is optimized for small-scale DNNs while Cambricon MLU-100 is optimized for large-scale ones.
Performance: Compression-Compilation Co-Design vs. Other Software Approaches
The next question is whether our approach also outperforms existing software optimization techniques on the same hardware. In other words, is the Compression-Compilation Co-Design approach superior to other software optimization techniques?
We evaluated our approach on a Samsung Galaxy S10 cell phone with the latest Qualcomm Snapdragon 855 mobile platform that consists of a Qualcomm Kryo 485 Octa-core CPU and a Qualcomm Adreno 640 GPU. The following figure shows the CPU and GPU performance of our approach compared to TFLite, TVM, and MNN on six representative DNNs, VGG-16 (VGG), ResNet-50 (RNT), and MobileNet-V2 (MBNT) trained on two datasets, ImageNet and CIFAR-10. The results verify that our approach (CoCo-Gen) outperforms all other frameworks for all cases. On CPU, our approach achieves 12× to 44.5× speedup over TFLite, 2.3× to 8.1× over TVM, and 1.9× to 15.5× over MNN, respectively. On GPU, our approach achieves 2.5× to 20×, 4.1× to 11.4×, and 2.5× to 6.2× speedup over TFLite, TVM, and MNN, respectively. For the largest DNN (VGG) and largest data set (ImageNet), our approach completes CONV layers on a single input within 18.9 ms on GPU, meeting the real- time requirement (usually 30 frames/sec, i.e., 33 ms/frame).
Applications Made Possible
The last, and probably the most important question, is what applications can we immediately enable on existing off-the-shelf mobile devices? This question is directly linked to the commercial potential of the software-only Compression-Compilation Co-Design approach.
To answer this question, we enabled three interesting DNN applications, including style transfer, DNN coloring, and super resolution. The style transfer model is based on a generative network trained on Microsoft COCO, and performs real-time styling on a live video stream. DNN coloring uses the places scene dataset to train a novel architecture that can jointly extract and fuse global and local features to perform colorization on a live black-and-white video stream. The super resolution model mainly utilizes residual blocks with wider activation and linear low-rank convolution trained on the DIV2K dataset, and the model can be used to upscale a low-resolution video stream into a high-resolution video stream.
As shown in the figure above, with structured pruning and compiler optimization, we implemented these applications on a Samsung Galaxy S10 mobile phone. Our approach accelerates the inference with speedups of 4.2×, 3.6×, and 3.7× for style transfer, coloring and super resolution, respectively. These results demonstrate that the software-only Compression-Compilation Co-Design approach generates satisfying output with high speed on mobile devices. More specifically, all inference can complete within 75 ms, showing the possibility of achieving real-time executions of complex DNN applications on existing off-the-shelf mobile devices without special hardware. Full video demos of these results can be accessed here:
More demo videos can be found here:
Software is Still Eating the World, Even in the AI Era
Our central thesis is that software is still eating the world, even in the AI era. In this article we hope to have convinced you that it is feasible and tangible to instill AI directly on existing commodity computing devices while offering even higher speeds and better energy efficiency than special AI hardware accelerators. This opens new opportunities for democratizing AI capability on edge devices, while invalidating the common perception on the indispensability of special AI hardware for real-time AI on end devices.
We believe that these results will prompt the industry to reexamine the directions and strategies on the pursue of mobile AI. The promising progress opens up many potential directions for future development. We list two of them here. The first is to expand the scope of the co-design based optimizations. Thus far, the principle of Compression-Compilation Co-Design has been focused on DNN models. Besides DNN, a real-world AI application often includes a lot of other parts, such as data collection, data preprocessing, the use of the DNN prediction in follow-up operations, and so on. Even though DNN may play an important role in the overall application, its optimizations may not be sufficient for the entire application to meet users’ needs. So an important direction is on how to generalize the Compression-Compilation Co-Design principle into holistic optimizations to the entire AI-based applications. The second is to increase the applicability of the co-design based optimizations. This direction relates with privacy and security. As they are two important factors in many AI model constructions and deployments, how to integrate them into the Compression-Compilation Co-Design process is worth pursuing. For instance, typically model pruning requires access to both the models and the training datasets, but there are scenarios where datasets may not be accessible to the model optimizer due to either privacy policies or artificial boundaries among corporations. Effective ways to circumvent these roadblocks could expand the applicability of the optimizations.
The software-only Compression-Compilation Co-Design approach can immediately enable real-time deep learning on billions of existing mobile devices and trillions of up-and-coming IoT devices, thus generating tremendous commercial values. To just name a few, this approach may enable great user experiences for streaming applications, such as Netflix, YouTube, TikTok, and Snap, even under low-bandwidth situations: These applications can stream low-resolution videos to user devices, and we can upscale the videos to high-definition in real time. Similarly, video communication applications such as Zoom, Skype, and WebEx, can utilize the Compression-Compilation Co-Design approach to deliver the best quality of service. In addition, the Compression-Compilation Co-Design approach unlocks real-time deep learning applications that have never been possible before, such as enabling a mobile phone camera to show live videos in an artistic style.
For more information, please contact us at email@example.com
This section provides more details for interested readers to understand how the Compression-Compilation Co-Design approach works. With the Compression-Compilation Co-Design approach, one can easily support all main kinds of DNNs, from CNNs to RNNs, transformer, language models, and so on. Also, this approach achieves the fastest DNN pruning and acceleration framework, up to 180X faster compared with current DNN pruning on other frameworks such as TensorFlow-Lite. As a result, the Compression-Compilation Co-Design approach enables AI applications to run in real-time on off-the-shelf mobile devices that have been previously regarded possible only with special hardware support.
If you want to dig into more of the technical details, a thorough technical overview of the Compression-Compilation Co-Design approach can be found here
CocoPIE: Making Mobile AI Sweet As PIE --Compression-Compilation Co-Design Goes a Long Way
Assuming hardware is the major constraint for enabling real-time mobile intelligence, the industry has mainly dedicated…
A demo of utilizing the Compression-Compilation Co-Design approach to achieve real-time video resolution upscaling on existing mobile devices can be found here: https://www.youtube.com/watch?v=UqaRtG5EVR4
For details on pattern-based pruning and algorithm-level optimization and the results, you can refer to the following research paper
· [AAAI’2020] Xiaolong Ma, Fu-Ming Guo, Wei Niu, Xue Lin, Jian Tang, Kaisheng Ma, Bin Ren, and Yanzhi Wang, PCONV: The Missing but Desirable Sparsity in DNN Weight Pruning for Real-Time Execution on Mobile Device, The 34th AAAI Conference on Artificial Intelligence, February, 2020.
For details on pattern-based compiler code generation and optimization framework and the integration of the algorithm-level optimization and system-level optimization, and the results, you can refer to the following research paper
· [ASPLOS’2020] Wei Niu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin Ren, PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning, The 25th International Conference on Architectural Support for Programming Languages and Operating Systems, March, 2020.
For details on Composability-based compiler framework to enable fast pruning of DNNs, and the results, you can refer to the following research paper
· [PLDI’2019] “Wootz: A Compiler-Based Framework for Fast CNN Pruning via Composability”, Hui Guan, Xipeng Shen, Seung-Hwan Lim, Programming Language Design and Implementation, Phoenix, AZ, USA, June, 2019.