Why VPU’s are the best solution for IoT deep learning projects (with Pytorch)
VPU (Visual Processing Unit) is a new piece of technology for processing visual data on IoT devices.
Nowadays, tech world offers a whole set of tools to serve your IoT devices with AI applications. For example you can find yourself thinking about buying an NVIDIA Jetson Nano or Coral TPU
There are soo many different approaches and different solutions.
But, while designing your project, bear in mind one simple piece of advice: whith IoT devices power consumption and efficiency are the leading requirements for your project to work and thermal concerns are not secondary.
And this is why VPU’s solutions may come in handy.
What VPU are made of
1. VLIW processors (Old technology revival)
VLIWs can achieve far higher performance, offering much more ILP with much lower silicon and power costs.
VLIW architecture was intoduced by J. A. Fisher back in 1983 in his paper “Very long instruction word architectures and the ELI-512,” (Proc. 10th Annu. Int. Symp. Computer Architecture, June 1983, pp. 140–150):
“More formally, VLIW architectures have the following properties:
There is one central control unit issuing a single long instruction per cycle. Each long instruction consists of many tightly coupled independent operations. Each operation requires a small, statically predictable number of cycles to execute. Operations can be pipelined.
In one of the lates article written by Fisher, Paolo Faraboschi and Cliff Young that you can find here:
it is explained further how VLIW compiler works and why VLIW are name so:
VLIW’s compiler rearranges the program in advance, picking what to issue and when to issue it, in order to maximize the parallel execution while maintaining correct behavior. Other processors (called superscalar) rely on the hardware to do this while the program runs.
Since the VLIW compiler presents many operations to be issued at once, it usually is asked to bundle them into a single, very long, instruction word — hence the description VLIW (64 -1024 bit)
2. Homogenous On-Chip Memory
in VPU’s your data is stored on-chip in order to minimize latency and off-chip data transfer
The centralized on-chip memory architecture allows for up to 400 GB/sec of internal bandwidth
3. Vision accelerators
Vision accelerators are specialized processors designed to deliver high performance machine vision at ultra low power, without introducing additional compute overhead.
In other words there are processors entirely dedicated in processing your video frames.
4. Neural compute engine
dedicated hardware accelerator for running on-device deep neural network applications.
Are VPUs really efficient compared to GPU?
Do to the lack of documentation and research I will reference to one of the best study out there Exploring the Vision Processing Unit as Co-processor for Inference (https://core.ac.uk/download/pdf/185526545.pdf)
Using a pre-trained network model based on the GoogLeNet work by Szegedy et al. , we have observed that the performance during inference on a single VPU chip is only 4× slower in comparison with reference CPU and GPU implementations. By employing a multi-VPU configuration, however, we have demonstrated equivalent performance results. Yet, the expected thermal-design power (TDP) can still be reduced by a factor of 8×.
VPU machine learning framework
while Coral TPU is made by Google and only supports the Tensorflow framework, the Intel VPU (Neural Compute Stick 2 ) can support TensorFlow, Caffe, ApacheMXNet, Open Neural Network Exchange, PyTorch, and PaddlePadle via an Open Neural Network Exchange conversion.
If you want to know more about Intel VPU you can read this article published in IEEE Micro · March 2015: