Deep Learning on the Edge — First Impressions of the Movidius Neural Compute Stick

tl;dr It is cheap, fast and low-powered.

I still remember my excitement in April 2016 when Movidius announced the soon-availability of the Fathom Neural Compute Stick that promised low-power ML capabilities for end devices. I tried to move mountains to get hold of a piece to no avail. Intel seemed to be more interested than me as they acquired the company in September — this definitely didn’t got me closer to grab the hardware. I waited patiently, hinted to my excitement of the possibility of distributed deep learning in my Tensorflow for Janitors presentation at the CRAFT Conference. Then on July 20th 2017 I managed to order one of the first few hundreds of sticks made available for the general audience.

The good and the bad

Opening the box and playing a lot came with the following pros and cons.


  • it’s a lovely piece of rugged hardware,
  • it delivers speed as it’s promised (although figuring that out was a bit trickier than expected),
  • Python binding (okay, it’s just Python 3),
  • documentation (more like a sparse reference).


  • need Ubuntu host to compile neural network (no Windows, no OSX, although a VM could do wonders),
  • Raspberry support is half-assed (need Ubuntu to compile, RasPi runs just the compiled network),
  • only Caffe is supported currently (no Tensorflow yet),
  • not open source, no GitHub repo.

I am really impressed by the idea of giving low-power end devices the chance to run neural networks so I set out to benchmark the finally released Movidius Neural Compute Stick with great hopes. I’m a big believer of hard facts so I wanted to benchmark this little beast.

Unfortunately it’s hard to come by to some kind of industry standard on ML speed benchmarks so after some considerations (NCS runs just Caffe models for now) I wanted to see execution time for different devices seeing a cat in ‘cat.jpg’ with Squeezenet.

Getting Caffe running on plain vanilla OSX, Ubuntu and Raspbian Jessie is a horrible experience.

A major finding is the general sad state of software deployment related to Caffe. There are exactly none working installation scripts for OSX, Ubuntu and Raspbian Jessie, so I had to spend quite some time to cook up my own recipes. All OSes were either freshly installed or close to it, so these plain vanilla recipes should work for most. For now I don’t delve into how much crap do you really need to install to have things running although I tried to add dependencies one by one so this could be considered as a minimal setup.

Gist: Caffe install + benchmark script for OSX

The Ubuntu 16.04 installation doc is just missing the following lines.

$ find . -type f -exec sed -i -e ‘s^”hdf5.h”^”hdf5/serial/hdf5.h”^g’ -e ‘s^”hdf5_hl.h”^”hdf5/serial/hdf5_hl.h”^g’ ‘{}’ \;
$ cd /usr/lib/x86_64-linux-gnu
$ sudo ln -s
$ sudo ln -s

Gist: Caffe install + benchmark for RPi3

Putting this aside I still wanted to see the hard numbers of milliseconds for devices seeing a cat in ‘cat.jpg’ with Squeezenet.

With Caffe on CPU gives me following averages:

  • 150 ms (Macbook Air 13", early 2015, 1,6 Ghz Intel Core i5, 8Gb RAM running macOS Sierra)
  • 190 ms (Lenovo Thinkpad T420S running Ubuntu 16.04)
  • 1100 ms (RasPi 3 running Raspbian Jessie July 2017)

Now hacking ncapi/py_examples/ as follows

print (str(
output, userobj = graph.GetResult()
print (str(

the Movidius stick gives me from an Ubuntu averages of 307 ms on cat.jpg.

Running this test on the Raspberry Pi involves installing OpenCV and compiling something like OpenCV on a RasPi is far from being instant. I posted my seminal results to the official forum and coming home from vacation I could carry on with the measurements.

Chrispete kindly served a working OpenCV install script — I only had to pimp it with two extra lines to make it work on my device (source 1, source 2).

$ export CPATH=”/usr/include/hdf5/serial/”
$ ln -s /usr/local/lib/python3.4/site-packages/

I finally got a working OpenCV on the Raspberry and I can confirm that the Movidius stick is behaving properly also showing the same timing benchmarks as testing from an Ubuntu.

Gist: OpenCV install script for Raspbian Jessie

Tome revealed that changing batch size to 1 and using all 12 SHAVE processors simultaneously should speed up things.

In another forum post he revealed some measurements and based on that this should be a 5(x)! speed-up — and “continous inference speed from webcam is about 9.5 FPS for GoogleNet.”

I changed in the /ncapi/networks/Squeezenet/NetworkConfig.prototxt

input_param { shape: { dim: 10 dim: 3 dim: 227 dim: 227 } }


input_param { shape: { dim: 1 dim: 3 dim: 227 dim: 227 } }

I added -s12 to all lines in ncapi/tools/ Double-checked all files are Squeezenet 1.1, recompiled, updated files on Raspberry.

I did some re-runs to believe what I see.

41 ms is definitely impressive. That’s 4 times faster than my Macbook Air for 1/6th of the price with an energy consumption of 1/7th.

(Macbook consumes approximately 35 Watts per hour, Raspberry and Movidius should end up at 4 W + 1 W = 5 W).

I’m looking forward to see Tensorflow supports and maybe it would be useful to have a precompiled Raspbian Jessie image at hand for all this mess — Tensorflow already has cross-compiled versions.