Rafal Firlejczyk
3 min readNov 13, 2018

Computer vision application on mobile device. Is it worth to use a dedicated hardware (DSP) to boost inference performance? Measurements based on Qualcomm SD660 SoC and Snapdragon Neural Processing Engine SDK.

Snapdragon SD660 can use CPU or GPU or DSP to execute computation. (Images source: Qualcomm)

Most of the upper class mobile devices have today an additional hardware component, which should speed up the on-device inference. Some times it is called NPU (Neural Processing Unit), Digital Signal Processor (DSP), Vector Processor or just hardware accelerator. All of them share the same idea of parallel computing and are expected to speed up significantly machine learning applications at runtime. In my example I tried to check, if the DSP which is built into Qualcomm SD660 boosts the performance of image classification app large enough to go for the extra effort, when developing the machine learning application.

SD660 built in 14nm technology, was released last year and consists of 4+4 CPU cores forming the ARM big.LITTLE architecture, Adreno 512 GPU and Hexagon 680 DSP. To facilitate the adoption of the new architecture Qualcomm released Snapdragon Neural Processing Engine (SNPE) SDK.

Source: Qualcomm

For the performance check I decided to run the Qualcomm application included in SDK documentation. After downloading the Inception_V3 model from Google I converted the tensorflow graph.pb to graph-quantized.dlc format, what is required by the test program called snpe-net-run. This program allows you to chose the hardware on which you would like to run the test. You may chose between CPU, GPU and DSP. Since this model has been trained on 1000 classes of ImageNet database, I decided to download one image per each class for the performance testing on the device. Having a 1000 cropped images for testing (resolution 299x299) should give the hardware enough time to show its strengths. And indeed it took a lot of time, especially for the CPU. It took 975 seconds to finish the 1000 images classification when run on CPU, 380 seconds on GPU and only 86 seconds when run on DSP. This video shows the speed of the process in these three cases. First on CPU, than on GPU and finaly on DSP.

snpe-net-run on SD660 for image classification. Inference comparison in CPU, GPU and DSP case
Measured classification time [sec] of 1000 images on SD660

Another important performance value we should care about is the energy consumption of the mobile device. My estimations showed, that DSP required the least energy. Running this application on GPU required about 7 x and on CPU required even 15 x more energy.

Conclusion.

DSP is a valuable hardware element when developing machine learning application. Both time and energy can be saved significantly.

However in case of SD660 the developer has to add an extra effort to convert the model to the dlc format. Additionally in order to use the DSP, the Qualcomm custom SNPE software framework has to be used. Android Oreo’s Neural Networks API seems not to be supported on this platform yet.

Rafal Firlejczyk

My passion is Machine Learning and its application in solving problems