NPU Performance On Smartphones Explained

7 min readDec 19, 2023

The advantages of NPUs, the challenges of integrating them with SoCs, and the best practices for measuring their performance

Neural Processing Units (NPUs) are specialized hardware components that accelerate the execution of deep neural networks (DNNs) on mobile devices. DNNs are the backbone of many artificial intelligence (AI) applications, such as face recognition, natural language processing, and computer vision. However, DNNs are also computationally intensive and require a lot of memory and power to run. This poses a challenge for smartphone users who want to enjoy the benefits of AI without compromising their battery life or privacy.

That's where NPUs come in. NPUs are designed to handle the heavy lifting of DNNs, while consuming less energy and resources than traditional processors. NPUs can also perform on-device AI, which means that the data and computations are done locally on the smartphone, rather than on a remote server. This improves the speed, security, and reliability of AI applications, as well as the user experience.

In this article, we will explain how NPUs work, how they measure their performance, and how they compare to other types of processors. We will also look at some of the latest developments and trends in the NPU market, and how they affect the future of smartphone AI.

What is an NPU and how does it work?

An NPU is a type of processor that is optimized for DNNs. DNNs are composed of multiple layers of artificial neurons, which are mathematical functions that mimic the behavior of biological neurons. Each neuron takes a set of inputs, applies a weight and a bias, and produces an output. The outputs of one layer are then fed as inputs to the next layer, until the final output is obtained.

The main operation that NPUs perform is called matrix multiplication, which is the process of multiplying two matrices (arrays of numbers) together. Matrix multiplication is essential for DNNs, as it allows them to learn the weights and biases of the neurons, and to propagate the inputs and outputs through the layers. However, matrix multiplication is also very demanding, as it involves a large number of arithmetic operations and memory accesses.

NPUs are able to speed up matrix multiplication by using specialized hardware architectures and techniques, such as:

- Parallelism: NPUs can execute multiple operations simultaneously, by using multiple cores, units, or engines. This increases the throughput and efficiency of the NPU.

- Quantization: NPUs can reduce the precision and size of the numbers used in the DNNs, by using fewer bits to represent them. This reduces the memory and power consumption of the NPU, as well as the latency and bandwidth of the data transfer. However, quantization can also affect the accuracy and quality of the DNNs, so it has to be done carefully and adaptively.

- Sparsity: NPUs can exploit the fact that many of the weights and inputs in the DNNs are zero or close to zero, and skip the operations that involve them. This reduces the computation and communication overhead of the NPU, as well as the energy consumption. However, sparsity can also vary depending on the DNN and the input, so it has to be detected and handled dynamically.

- Compression: NPUs can compress the data and the weights in the DNNs, by using techniques such as pruning, clustering, or coding. This reduces the storage and bandwidth requirements of the NPU, as well as the power consumption. However, compression can also introduce some errors and noise in the DNNs, so it has to be balanced with the performance and quality.

How is NPU performance measured?

NPU performance can be measured by using different metrics and benchmarks, depending on the goal and the context of the evaluation. Some of the common metrics and benchmarks are:

- TOPS: Tera Operations Per Second, which is the number of arithmetic operations that the NPU can perform in one second, in the order of trillions. TOPS is a measure of the theoretical peak performance of the NPU, but it does not account for the actual workload, the efficiency, or the power consumption of the NPU.

- TOPS/W: Tera Operations Per Second per Watt, which is the ratio of the TOPS to the power consumption of the NPU, in watts. TOPS/W is a measure of the energy efficiency of the NPU, but it does not account for the actual workload, the precision, or the quality of the NPU.

- FPS: Frames Per Second, which is the number of images or frames that the NPU can process in one second, for a given DNN and a given input resolution. FPS is a measure of the practical performance of the NPU, but it depends on the specific DNN, the input resolution, and the hardware platform that the NPU is running on.

- Inference Time: The time that the NPU takes to process one image or frame, for a given DNN and a given input resolution. Inference time is the inverse of FPS, and it is a measure of the latency of the NPU, which affects the user experience and the responsiveness of the AI application.

- Accuracy: The percentage of correct predictions or outputs that the NPU produces, for a given DNN and a given input dataset. Accuracy is a measure of the quality and reliability of the NPU, but it depends on the specific DNN, the input dataset, and the precision and the techniques used by the NPU.

There are several benchmarks that can be used to measure and compare the performance of NPUs, such as:

- MLPerf: A suite of standardized benchmarks for measuring the performance of machine learning systems, including NPUs. MLPerf covers different domains and tasks, such as image classification, object detection, natural language processing, and recommendation systems. MLPerf also provides different scenarios and modes, such as offline, online, and edge, and single-stream, multi-stream, and server. MLPerf can be used to measure the performance of NPUs for both training and inference, and for both cloud and edge devices.

- AI-Benchmark: A benchmarking tool for measuring the performance and the capabilities of NPUs on Android devices. AI-Benchmark covers different DNNs and tasks, such as image classification, face recognition, image deblurring, image super-resolution, and semantic segmentation. AI-Benchmark can be used to measure the performance of NPUs for inference, and for both CPU and GPU modes.

- EEMBC: A consortium of companies that develops benchmarks for embedded systems, including NPUs. EEMBC provides different benchmarks for different domains and applications, such as automotive, industrial, IoT, and mobile. EEMBC also provides a benchmark for measuring the performance and the accuracy of NPUs for image recognition, called MLMark.

How do NPUs compare to other types of processors?

NPUs are not the only type of processors that can run DNNs on mobile devices. There are other types of processors that can also perform DNN inference, such as:

- CPU: Central Processing Unit, which is the general-purpose processor that executes the main logic and instructions of the device. CPUs are flexible and programmable, but they are not very efficient or fast for DNNs, as they have limited parallelism and memory bandwidth.

- GPU: Graphics Processing Unit, which is the processor that handles the graphics and the display of the device. GPUs are more suitable for DNNs than CPUs, as they have higher parallelism and memory bandwidth, but they are still not very efficient or power-saving, as they are designed for graphics rather than DNNs.

- DSP: Digital Signal Processor, which is the processor that handles the audio and the video signals of the device. DSPs are more efficient and power-saving than GPUs, as they are optimized for signal processing rather than graphics, but they are still not very flexible or scalable, as they have fixed architectures and instruction sets.

- FPGA: Field Programmable Gate Array, which is a reconfigurable chip that can be programmed to implement any logic or function. FPGAs are very flexible and adaptable, as they can be customized for any DNN, but they are also very complex and expensive, as they require a lot of resources and expertise to design and program.

NPUs are different from these types of processors, as they are specifically designed and optimized for DNNs. NPUs have the following advantages over other types of processors:

- Higher performance: NPUs can achieve higher TOPS, FPS, and inference time than other types of processors, as they can exploit the parallelism, the quantization, the sparsity, and the compression of DNNs.

- Lower power consumption: NPUs can achieve lower power consumption and higher TOPS/W than other types of processors, as they can reduce the memory and the communication overhead of DNNs.

- Better user experience: NPUs can provide better user experience and satisfaction than other types of processors, as they can enable faster, smoother, and more reliable AI applications on mobile devices.

- Enhanced privacy and security: NPUs can enhance the privacy and security of the users and their data, as they can enable on-device AI, which does not require sending the data to a remote server or cloud.

What are the latest developments and trends in the NPU market?

The NPU market is a fast-growing and competitive market, with many players and products. According to the report by MarketsandMarkets, the NPU market is expected to grow from USD 2.02 billion in 2022 to USD 6.68 billion by 2027, at a compound annual growth rate (CAGR) of 27.1%. The report cites the increasing demand for high-performance and low-power AI applications, the adoption of NPUs in edge computing devices, and the emergence of 5G networks as the key drivers of the market growth. The report also identifies the major challenges and opportunities for the NPU market, such as the high development costs, the compatibility issues, and the potential applications in various domains.

Reference:

https://inquisitiveuniverse.com/2023/12/11/npu-performance-on-socs-explained/