Deep Learning on the Edge — First Impressions of the Movidius Neural Compute Stick

Molnár Dániel
Sep 6, 2017 · 4 min read

tl;dr It is cheap, fast and low-powered.

(Updated with Raspberry PI and Tensorflow developments.)

I still remember my excitement in April 2016 when Movidius announced the soon-availability of the Fathom Neural Compute Stick that promised low-power ML capabilities for end devices. I tried to move mountains to get hold of a piece to no avail. Intel seemed to be more interested than me as they acquired the company in September — this definitely didn’t got me closer to grab the hardware. I waited patiently, hinted to my excitement of the possibility of distributed deep learning in my Tensorflow for Janitors presentation at the CRAFT Conference. Then on July 20th 2017 I managed to order one of the first few hundreds of sticks made available for the general audience.

The good and the bad

Opening the box and playing a lot came with the following pros and cons.


  • it’s a lovely piece of rugged hardware,
  • it delivers speed as it’s promised (although figuring that out was a bit trickier than expected),
  • Python binding (okay, it’s just Python 3)
  • full-fledged Raspberry PI support,
  • Tensorflow support besides Caffe,
  • open source, no GitHub repo,
  • documentation (more like a sparse reference).


  • need Ubuntu host to compile neural network (no Windows, no OSX, although a VM could do wonders).

When I originally published this piece I had 3 more cons, now that’s the only one left.

I am really impressed by the idea of giving low-power end devices the chance to run neural networks so I set out to benchmark the finally released Movidius Neural Compute Stick with great hopes. I’m a big believer of hard facts so I wanted to benchmark this little beast.

Unfortunately it’s hard to come by to some kind of industry standard on ML speed benchmarks so after some considerations (NCS runs just Caffe models for now) I wanted to see execution time for different devices seeing a cat in ‘cat.jpg’ with Squeezenet.

Getting Caffe running on plain vanilla OSX, Ubuntu and Raspbian Jessie is a horrible experience.

A major finding is the general sad state of software deployment related to Caffe. There are exactly none working installation scripts for OSX, Ubuntu and Raspbian Jessie, so I had to spend quite some time to cook up my own recipes. All OSes were either freshly installed or close to it, so these plain vanilla recipes should work for most. For now I don’t delve into how much crap do you really need to install to have things running although I tried to add dependencies one by one so this could be considered as a minimal setup.

Gist: Caffe install + benchmark script for OSX

The Ubuntu 16.04 installation doc is just missing the following lines.

$ find . -type f -exec sed -i -e ‘s^”hdf5.h”^”hdf5/serial/hdf5.h”^g’ -e ‘s^”hdf5_hl.h”^”hdf5/serial/hdf5_hl.h”^g’ ‘{}’ \;
$ cd /usr/lib/x86_64-linux-gnu
$ sudo ln -s
$ sudo ln -s

Gist: Caffe install + benchmark for RPi3

Putting this aside I still wanted to see the hard numbers of milliseconds for devices seeing a cat in ‘cat.jpg’ with Squeezenet.

With Caffe on CPU gives me following averages:

  • 150 ms (Macbook Air 13", early 2015, 1,6 Ghz Intel Core i5, 8Gb RAM running macOS Sierra)
  • 190 ms (Lenovo Thinkpad T420S running Ubuntu 16.04)
  • 1100 ms (RasPi 3 running Raspbian Jessie July 2017)

Now hacking ncapi/py_examples/ as follows

print (str(
output, userobj = graph.GetResult()
print (str(

the Movidius stick gives me from an Ubuntu averages of 307 ms on cat.jpg.

Running this test on the Raspberry Pi involves installing OpenCV and compiling something like OpenCV on a RasPi is far from being instant. I posted my seminal results to the official forum and coming home from vacation I could carry on with the measurements.

Chrispete kindly served a working OpenCV install script — I only had to pimp it with two extra lines to make it work on my device (source 1, source 2).

$ export CPATH=”/usr/include/hdf5/serial/”
$ ln -s /usr/local/lib/python3.4/site-packages/

I finally got a working OpenCV on the Raspberry and I can confirm that the Movidius stick is behaving properly also showing the same timing benchmarks as testing from an Ubuntu.

Gist: OpenCV install script for Raspbian Jessie

Tome revealed that changing batch size to 1 and using all 12 SHAVE processors simultaneously should speed up things.

In another forum post he revealed some measurements and based on that this should be a 5(x)! speed-up — and “continous inference speed from webcam is about 9.5 FPS for GoogleNet.”

I changed in the /ncapi/networks/Squeezenet/NetworkConfig.prototxt

input_param { shape: { dim: 10 dim: 3 dim: 227 dim: 227 } }


input_param { shape: { dim: 1 dim: 3 dim: 227 dim: 227 } }

I added -s12 to all lines in ncapi/tools/ Double-checked all files are Squeezenet 1.1, recompiled, updated files on Raspberry.

I did some re-runs to believe what I see.

41 ms is definitely impressive. That’s 4 times faster than my Macbook Air for 1/6th of the price with an energy consumption of 1/7th.

(Macbook consumes approximately 35 Watts per hour, Raspberry and Movidius should end up at 4 W + 1 W = 5 W).

Soon after this write-up Tensorflow support arrived and it’s still on my to-do list to build a precompiled Raspbian Jessie image to be at hand for all this mess — Tensorflow already has cross-compiled versions.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store