Benchmarking Edge Computing

Comparing Google, Intel, and NVIDIA accelerator hardware

Alasdair Allan
May 7 · 30 min read

Over the last year custom silicon, intended to speed up machine learning inferencing on the edge, has started to appear. No cloud needed. First to market was Intel with their Moividius-based hardware. However over the last couple of months we’ve seen the arrival of both Google, with their EdgeTPU-based hardware called Coral, and NVIDIA with their GPU-based offering the Jetson Nano.

An edge computing hardware zoo. Here we have the Intel Neural Compute Stick 2 (left, top), a Movidus Neural Compute Stick (left, bottom), the NVIDIA Jetson Nano (middle, top), a Raspberry Pi 3, Model B+ (middle, bottom), a Coral USB Accelerator (right, top), and finally the Coral Dev Board (right, bottom).

The arrival of new hardware designed to run machine learning models at vastly increased speeds, and inside a relatively low power envelope, without needing a connection to the cloud, makes edge based computing that much more of an attractive proposition. Especially as alongside this new hardware we’ve seen the release of TensorFlow 2.0 as well as TensorFlow Lite for micro-controllers and new ultra-low powered hardware like the SparkFun Edge.

The ecosystem around edge computing is starting to feel far more mature. Which means that biggest growth area in machine learning practice over the next year or two could well be around inferencing, rather than training.

The only question now being, which of the new acceleration platforms can inference faster? Time to run some benchmarking and find that out.

We’re going to go ahead and compare inferencing on the following platforms; the Coral Dev Board, the NVIDIA Jetson Nano, the Coral USB Accelerator with a Raspberry Pi, the original Movidus Neural Compute Stick with a Raspberry Pi, and the second generation Intel Neural Compute Stick 2 again with a Raspberry Pi. Finally just to add a yard stick, we’ll also run the same models on my Apple MacBook Pro (2016), which has a quad-core 2.9 GHz Intel Core i7, and a vanilla Raspberry Pi 3, Model B+ without any acceleration.

Inferencing speeds in milli-seconds for MobileNet SSD V1 (orange) and MobileNet SSD V2 (red) across all tested platforms. Low numbers are good!

This initial benchmark run was with the MobileNet v2 SSD and MobileNet v1 SSD models, both models trained on the Common Objects in Context (COCO) dataset. A single 3888×2916 pixel test image was used which contained two recognisable objects in the frame, a banana🍌 and an apple🍎. The image was resized down to 300×300 pixels before presenting it to the model, and each model was run 10,000 times before an average inferencing time was taken.

Benchmarking results in milli-seconds for MobileNet v1 SSD 0.75 depth model and the MobileNet v2 SSD model both trained using the Common Objects in Context (COCO) dataset with an input size of 300×300, alongside idle and peak current consumption for the platforms before and during extended testing.

Perhaps unsurprisingly results show that the two dedicated boards, the Coral Dev Board from Google and the Jetson Nano Developer Kit from NVIDIA, are the best performing out of our surveyed platforms. Of these two boards the Coral Dev Board ran faster, with inferencing times around ×4 shorter than the Jetson Nano for the same machine learning model.

ℹ️ Information While I’ve included the price for the platforms alongside the benchmark results, with the ‘extra’ $35 you’ll need for a Raspberry Pi added to the sticker price for the Coral USB Accelerator and Intel Neural Compute Sticks, that headline figure isn’t everything here. You should look at price with respect to the performance rather than just the initial sticker shock, and for edge computing that performance isn’t just going to be all about the inferencing speeds.

Part I — Benchmarking

Benchmarking was done using TensorFlow, or for the hardware accelerated platforms that do not support TensorFlow their native framework, using the same models used on the other platforms converted to the appropriate native framework. For the Coral EdgeTPU-based hardware we used TensorFlow Lite, and for Intel’s Movidius-based hardware we used their OpenVINO toolkit. We’ll also benchmarked the NIVIDIA’s Jetson Nano both with ‘vanilla’ TensorFlow (with GPU support), and then again with the same TensorFlow model but optimised using NVIDIA’s TensorRT framework.

Detecting fruit on the workbench in our test image. 🍌🍎

Inferencing was carried out with the MobileNet v2 SSD and MobileNet v1 0.75 depth SSD models, both models trained on the Common Objects in Context (COCO) dataset. The 3888×2916 pixel test image was resized down to 300×300 pixels before presenting it to the model, and each model was run 10,000 times before an average inferencing time was taken. The first inferencing run, which takes longer due to loading overheads, was discarded.

Benchmarks were carried out twice on the NVIDIA Jetson Nano, first using vanilla TensorFlow models, and a second time using those models after optimisation using NVIDIA’s TensorFlow with TensorRT library.

Benchmarking results in milli-seconds for MobileNet v1 SSD 0.75 depth model and the MobileNet v2 SSD model both trained using the Common Objects in Context (COCO) dataset with an input size of 300×300.

Unsurprisingly the unaccelerated Raspberry Pi fairs the worst of any of the platforms we benchmarked, managing to sustain inferencing at just 1 to 2 fps.

The unaccelerated Raspberry Pi performs the worst from the benchmarked platforms.

If you’re interested in pushing the performance of the Raspberry Pi you could try building TensorFlow Lite for the Raspberry Pi. Unfortunately there are currently no binary distributions available, it can’t be deployed using pip. So if you want to try out TensorFlow Lite, you’d going to have to build it from source either by cross-compiling, or natively on the Raspberry Pi itself. I’m not going to go down that route right now, so instead we’ll drop the Raspberry Pi as an outlier, and take a closer look at the other platforms.

🆕 Update (8/May) I’ve now sat down and taken a look at the performance difference between running TensorFlow and TensorFlow Lite on the Raspberry Pi. As you’d expect, things run faster. But I was surprised how much faster.

Benchmarking results with the outlying results from the unaccelerated Raspberry Pi removed showing the relative speeds for MobileNet SSD V1 with 0.75 depth (orange) and MobileNet SSD V2 (red) across platforms.

Our results from the Jetson Nano are particularly interesting when compared against the benchmarking results released by NVIDIA for the board.

Results from NVIDIA’s own benchmarking. (📊: NVIDIA)

We’re seeing a significant slower inferencing in our own benchmarking using TensorFlow than in the NVIDIA tests, around ×3 slower with MobileNet v2 SSD. However going back to their original code, which was written in C++ and uses native TensorRT for inferencing, and following their benchmarking instructions I was able to successfully reproduce their published MobileNet V2 benchmark performance times.

TensorFlow (dark blue) compared to TensorFlow with TensorRT optimisation (light blue) for MobileNet SSD V1 with 0.75 depth (left) and MobileNet SSD V2 (right).

While our models optimised using TensorRT run considerably faster on the Jetson Nano than vanilla TensorFlow models, they still don’t run as fast as the those in the original NVIDIA C++ benchmark tests. Talking with NVIDIA they tell me that this isn’t just the difference between a compiled and an interpreted language, between C++ and Python.

Instead they say that “…there is a bug with TensorFlow and TensorFlow with TensorRT that NVIDIA is now working to fix.” However setting this problem to one side, what is clear from our results is that if you want to run a TensorFlow model on the Jetson Nano, then you really need to use TensorRT to increase inferencing speed. The difference in performance between vanilla and post-optimised models is more than a factor of ×4 in speed.

Comparing the Coral Dev Board against the Jetson Nano

Of the two dedicated boards, the Coral Dev Board and the NVIDIA Jetson Nano, inferencing was a factor of roughly ×4 faster on the Coral hardware for both models. However bear in mind that, while it’s still extremely early days, TensorFlow Lite has recently introduced support for GPU acceleration for inferencing. Running models using TensorFlow Lite with GPU support should reduce the time needed for inferencing on the Jetson Nano. This leaves open the possibility that the gap between the platforms might shrink in the future.

NVIDIA would also argue that, since it supports full TensorFlow rather than TensorFlow Lite, that their platform has other advantages. But this edge has narrowed after Google lifting model restrictions from the Edge TPU Model Compiler earlier in the month.

However flipping that argument entirely on its head. Despite the fact that these two boards are always going to get directly compared, they really are built for different purposes. NVIDIA’s board is built around their existing GPU technology, while Google’s board is custom ASIC aimed directly at running smaller quantised models. We’re seeing that the Edge TPU hardware is faster here because running smaller models at the edge is what its designed to do. NVIDIA’s GPU-based hardware is more flexible, but that extra capability comes with a speed penalty.

Results from Google’s own benchmarking. (📊: Google)

However as well as the differences between NVIDIA’s benchmarks and our own, we also observe a big difference between our results and Google’s own published benchmarks for the Coral hardware. We’re seeing much slower inferencing times, by a factor of as much as ×8, for the Coral Dev Board.

The most obvious difference here is that we’re using different models, Google’s MobileNet benchmarks were done using models with 224×224 input size whereas I was using a 300×300 input size and MobileNet SSD.

But the real difference arises because we are measuring different things.

While we resize the image before we start timing our inferencing, we’re still passing an image to TensorFlow which then has to take this and convert it into something that our model can use. Google’s numbers are for the inferencing stage on its own, that’s a much cleaner (and shorter) operation. I’d argue that my benchmark is a bit more representative of what people will be doing out in the world. Google’s numbers are fine, but most people aren’t really interested in the time it takes between passing a tensor to the model and getting a result.

But if you really want things to run faster, there’s a lot of optimisation you can do to our inferencing loop especially when it comes to image conversion to something TensorFlow will pass to the model.

The difference between our benchmarks and Google’s for the Coral USB Accelerator is even wider, by factor of something like ×20 which is pretty crazy. Not all of that can be written off the same problem as before. However, I think the rest of the gap between the results almost certainly come down to the speed difference between USB 2.0 and USB 3.0.

While the Coral USB Accelerator is capable of USB 3.0 speeds, we’re throttling it by connecting it to a Raspberry Pi which has a slower USB 2.0 bus instead. That’s going to slow everything down. This combined with the issue where Google’s and our own benchmarks are just flat out just measuring different things, is why these numbers are so far apart.

Comparing the USB connected accelerators.

Looking at the accelerator hardware the Coral USB Accelerator is faster by a factor of just under ×2 than the Intel Neural Compute Stick 2. With Intel’s older, first generation, Movidius Neural Compute Stick falling into last place. While performance for the Intel Neural Compute Stick is also well below Intel’s own benchmarking, it is consistent with other results I’ve seen for the hardware.

However what is impressive for both the Google and Intel platforms is the speed up between the accelerated and original inferencing times on the Raspberry Pi despite both being throttled by slower USB 2.0 bus speeds.

We’re seeing a ×10 speed increase when using the Coral USB Accelerator compared to the original Raspberry Pi timings. That’s impressive.

Comparing the original Raspberry Pi inferencing times with the accelerated timings.

This is important when you consider what the USB Accelerator isn’t. Unlike the Coral Dev Board, which is intended as an evaluation board for the System-on-Module (SoM) that will be made available in volume later in the year, the USB Accelerator is more likely aimed at data scientists and makers, rather than embedded hardware developers. Data scientists will be using the accelerator with their Linux laptop to crunch on their data, while makers will be using it with the Raspberry Pi to build robots and autonomous vehicles. Both groups could benefit from faster inferencing times which it delivers.

While inferencing speed is probably our most important measure, these are device intended to do machine learning at at the edge. That means we also need to pay attention to environmental factors. Designing a smart object isn’t just about the software you put on it, you also have to pay attention to other factors, and here we’re especially concerned with heating and cooling, and the power envelope. Because it might be necessary to trade off inferencing speed against these other factors when designing for the Internet of Things.

Therefore, along with inferencing speed, when discussing edge computing devices it’s also important to ascertain the heat and power envelopes. So lets go do that now.

Current measurements were made using a multi-meter inline with the USB cable. Two different meters were used; the one with the Raspberry Pi boards and NVIDIA Jetson Nano was designed for micro-USB connections, the other used with the Coral Dev Board and MacBook Pro was designed for USB-C connections. Both meters had a reported accuracy of ±0.01 A (10mA).

Measuring the current flow for the Coral Dev Board during inferencing.

Measurements for the current for the NVIDIA Jetson Nano were carried out when the board was operating headless, without the use of a monitor, keyboard, or mouse.

Idle and peak current consumption for our benchmarked platforms before and during extended testing.

For pretty understandable reasons the MacBook Pro is an obvious outlier in this table. While both it and the Coral Dev Board are powered using USB-C, they’re very different beasts. While the Coral Dev Board runs at 5V like our other platforms, the MacBook Pro expects 20V. So while I went ahead and measured current draw for interest, going forward we’re going to ignore it.

$ ./osx-cpu-temp -f
Num fans: 2
Fan 0 - Left side   at 3767 RPM (100%)
Fan 1 - Right side  at 3487 RPM (100%)

I really only wanted to know how much those spinning fans were pulling.

Except for the MacBook Pro, all of our platforms take a nominal 5V input supply. However in reality the voltage will bounce around somewhat due to demands made by the board, and most USB supplies actually sit at around +5.1 to +5.2V. So when doing rough calculations to get the power (in Watts) I’d normally take a the voltage of a USB supply to be +5.15V as a good supply will usually try and maintain the supplied voltage around this figure despite rapid fluctuations in current draw.

Those fluctuations in demand is something that happens a lot with when you’re using peripherals with the Raspberry Pi and often cause brown outs, and they are something that a lot of USB chargers — designed to provide consistent current for charging cellphones — usually don’t cope with all that well.

Idle current (in green, left hand bars) compared to peak current (in yellow, right hand bars).

Looking at the Coral Dev Board and its direct comparable, the NVIDIA Jetson Nano, we see that the Dev Board pulls more current while idle but less when it is inferencing. So somewhat swings and balances there depending on the amount of time the device will spend idle.

ℹ️ Information The peak current for the NVIDIA Jetson Nano is less than you might expect given the advice around powering the board which recommends a 4A power supply if you are running benchmarks or a heavy workload. This is probably since we are running the NVIDIA board headless, without a monitor, keyboard, or mouse. These accessories add an additional power overhead. For instance attaching a monitor to the HDMI port uses 50mA, adding a camera module requires 250mA, and keyboards and mice can take as little as 100mA or over 1,000mA depending on the make and model.

On the other hand while all of the Raspberry Pi-based accelerators add somewhat to the idle draw, they draw less current than an unaccelerated Raspberry Pi does during inferencing. In other words, they run much faster but with a lower current draw, they still consume less power.

External temperatures were measured using a laser infrared thermometer which has an accuracy of ±2°C for temperatures ≤100°C after a extended test run of 50,000 inferences was completed.

Measuring the external temperature of the NVIDIA Jetson Nano.

Temperatures were measured using a number of methods; in the case of the Coral Dev Board and NVIDIA Jetson Nano the temperature of the heatsinks was taken, while for the Coral USB Accelerator and Intel and Movidius Neural Compute Sticks the temperature of the external cases was measured. For the Raspberry Pi the external temperature was measured from the package enclosing SoC, while for the MacBook Pro the temperature of the underside of the laptop was measured.

Peak external and peak CPU temperatures during inferencing in °C.

The CPU temperatures were as reported by the operating system using the following command line invocation, on the Raspberry Pi, Coral Dev Board, and NVIDIA Jetson Nano.

$ paste <(cat /sys/class/thermal/thermal_zone*/type) <(cat /sys/class/thermal/thermal_zone*/temp) | column -s $'\t' -t | sed 's/\(.\)..$/.\1°C/'
AO-therm         38.0°C
CPU-therm        31.0°C
GPU-therm        30.5°C
PLL-therm        28.5°C
PMIC-Die         100.0°C
thermal-fan-est  31.0°C

The exact response varied depending the number of sensors installed on the platform, from a single CPU sensor on the Raspberry Pi up to the 6 sensors present on the Jetson Nano. For convenience the MacBook Pro temperatures were measured using the osx-cpu-temp command line application.

Peak external (red, left hand bars) and peak CPU (purple, right hand bars) temps during inferencing in °C.

The unaccelerated Raspberry Pi board running vanilla TensorFlow reached a temperature of 74°C during the extended test which meant that it suffered from thermal throttling of the CPU, it came close to the 80°C point where additional incremental throttling would occur. I’d therefore recommend that, if you intended to run inferencing for extended periods using the Raspberry Pi, you should add at least a passive heatsink to avoid throttling the CPU. It’s even possible that a small fan might also be a good idea. Because let’s face it, CPU throttling can spoil your day.

Another outlier to pick out here is the Coral USB Accelerator. The exterior of the USB Accelerator stayed remarkably cool during inferencing compared to its direct competitor the Intel Neural Compute Stick 2 which ran 10°C hotter. What’s really sort of interesting however is that this was at the expense of the Raspberry Pi CPU temperature, which ran +8 to +9°C hotter with than with the Neural Compute Stick. I’m actually sort of intrigued to find out what’s going on there?

Measuring the external temperature of the Coral Dev Board.

Anecdotally at least the fan on the Coral Dev Board seems to spin up when the CPU temperature reaches ~65°C, which reduces the CPU temperature down to ~60°C. This drops the external temperature of the heatsink from ~50°C down to ~35°C. This is pretty reasonable, so the fan seems well specified for the board.

At least for now it looks like the Coral Dev Board and USB Accelerator have a clear lead, with MobileNet models running between ×3 to ×4 times faster than the direct competitors.

However as we’ve already discussed, inference speed isn’t the only deciding factor for hardware intended to be deployed out into the world on the edge of things, and even this rather sprawling benchmarking review hasn’t touched on all of those factors. There’s also the issue around the potential bug in the TensorRT Python implementation, which will make these benchmarks worth revisiting after NVIDIA have addressed the issue to see how much fixing that will change the results.

Part II—Methodology

If you’re interested in reproducing these results, or just want to get a much better understanding of my methodology, I’ve made all the resources you’ll need duplicate the benchmarking available for download. Along with that you should keep reading for a full walkthrough on preparing all the platforms, converting our models to run on them, and the code you’ll need to do it.

Setting up the Coral Dev board can be done by following my instructions which will walk you through the initial setup of the board, as well as running a few initial example models on the hardware.

Everything you need to get started setting up the Coral Dev Board.

ℹ️ Information If you don’t have a spare laptop running Linux to hand you can use a Raspberry Pi to flash new firmware onto the Coral Dev Board instead.

However there have been some changes to the Mendel OS distribution since my original instructions were written that include increased security for SSH authentication. You can no longer immediately SSH into the Coral Dev Board using the ‘mendel’ user because password authentication has been disabled by default. You must now first transfer an SSH key onto the board using the new Mendel Development Tool (MDT).

In a similar fashion to the Coral Dev Board you can set up the Coral USB Accelerator for use on the Raspberry Pi by following my instructions.

Everything you need to get started setting up the the Coral USB Accelerator.

If you want to use the USB Accelerator with your laptop rather than with a Raspberry Pi it is compatible with any computer with a USB port running Debian 6.0 or higher (or derivative such as Ubuntu 10.0+). It can be used with machines with either an x86_64 or ARM64, with ARMv8 instruction set, processor.

ℹ️ Information While the USB Accelerator can be used with any USB port if you connect it to a USB 2.0 port (as we do with the Raspberry Pi) you will throttle throughput to/from the accelerator as it has a USB 3.1 interface capable of 5Gb/s transfer speed. This will slow inferencing. Yes, this affected our results here.

Setting up the Intel Neural Compute Stick 2 is very similar to the USB Coral Accelerator, and again you can follow my instructions to set it up and start running your first few example models.

Everything you need to get started setting up the Intel Neural Compute Stick 2.

ℹ️ Information Preparation of the Raspberry Pi to use the older first generation Movidius Neural Compute Stick should be exactly the same as for the Intel Neural Compute Stick 2 as the OpenVINO toolkit supports both generations of hardware.

Finally you can set up the NVIDIA Jetson Nano by following my instructions. Initial setup and installation of the Jetson Nano is probably the most lengthly of any of the platforms as TensorFlow isn’t pre-installed in the default image.

Everything you need to get started setting up the NVIDIA Jetson Nano.

⚠️Warning While the Jetson Nano OS image fits on to a 16GB card, using that small a card will usually result in the root filesystem filing up and ‘no space left on device’ errors during inferencing. Make sure you use at least a 32GB card when flashing the image and, if available, a 64GB card might be a better choice .

However in addition to our initial setup, to run the benchmarking scripts we need make sure the Pillow fork of the Python Imaging Library (PIL) is also installed along with some missing dependencies.

$ sudo apt-get install libfreetype6 libfreetype6-dev
$ pip3 install Pillow

We also need to install the object_detection library which we’ll need for model optimisation, starting with its dependencies.

$ sudo apt-get install protobuf-compiler python-pil python-lxml python-tk
$ pip3 install --user Cython
$ pip3 install --user contextlib2
$ pip3 install --user jupyter
$ pip3 install --user matplotlib

Now go ahead and grab the TensorFlow Models repository, which contains the TensorFlow Object Detection API and associated files, along with the COCO API repository.

$ git clone https://github.com/tensorflow/models.git
$ git clone https://github.com/cocodataset/cocoapi.git

We now need to build the COCO API and copy the pycocotools sub directory into place inside the TensorFlow Models distribution. You’ll need to install the Python setup tools if they are not already installed.

$ sudo apt-get install python3-setuptools
$ cd cocoapi/PythonAPI

Since we’re using Python 3.x rather than Python 2.x you’ll now need to make a quick edit to the Makefile, replacing occurrences of python with python3.

So go ahead and open the Makefile in you favourite editor, and afterwards it should look like this,

all:
# install pycocotools locally
python3 setup.py build_ext --inplace
rm -rf buildinstall:
# install pycocotools to the Python site-packages
python3 setup.py build_ext install
rm -rf build

then make pycocotools locally and copy them into the models directory,

$ make
$ cp -r pycocotools ~/models/research/

and go ahead and build the Protobuf libraries.

$ cd ~/models/research
$ protoc object_detection/protos/*.proto --python_out=.

⚠️Warning If you’re getting errors while compiling, you might be using an incompatible protobuf compiler. If that’s the case, you should use the manual installation instructions.

Go ahead and add the object_detection directories to your PYTHONPATH,

$ export PYTHONPATH=$PYTHONPATH:/home/jetson/models/research:/home/jetson/models/research/slim

and finally into your ~/.bashrc file.

$ echo 'export PYTHONPATH=$PYTHONPATH:/home/jetson/models/research:/home/jetson/models/research/slim' >> ~/.bashrc

You can test that you have correctly installed the TensorFlow Object Detection API by running the model_builder_test.py test script as below.

$ cd ~/models/research
$ python3 object_detection/builders/model_builder_test.py
   .
   .
   .
Ran 16 tests in 0.309s
OK (skipped=1)
$

Creating a swap file on the Jetson Nano may also help with larger models.

$ sudo fallocate -l 4G /var/swapfile
$ sudo chmod 600 /var/swapfile
$ sudo mkswap /var/swapfile
$ sudo swapon /var/swapfile
$ sudo bash -c 'echo "/var/swapfile swap swap defaults 0 0" >> /etc/fstab'

However the decision to add one should be based on the size of your SD Card. It’s probably not advisable to do so unless you’ve used a card 64GB or greater.

Installing TensorFlow on the Raspberry Pi used to be a difficult process, however towards the middle of last year everything became a lot easier.

Everything you need to get started setting up the Raspberry Pi

Go ahead and download the latest release of Raspbian Lite and set up your Raspberry Pi. Unless you’re using wired networking, or have a display and keyboard attached to the Raspberry Pi, at a minimum you’ll need to put the Raspberry Pi on to your wireless network, and enable SSH.

Once you’ve set up your Raspberry Pi go ahead and power it on, and then open up a Terminal window on your laptop and SSH into the Raspberry Pi.

% ssh pi@raspberrypi.local

Once you’ve logged in you can install TensorFlow.

$ sudo apt-get install libatlas-base-dev
$ sudo apt-get install python3-pip
$ pip3 install tensorflow

It’ll take some time to install. So you might want to take a break and get some coffee. Once it has finished installing you can test the installation as follows.

$ python3 -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"

⚠️Warning Unless the official TensorFlow package has been updated recently you will receive Runtime Warnings when you import tensorflow. These aren’t a concern, and just indicate that the wheels were built under Python 3.4 and you’re using them with Python 3.5. They’re compatible with newer Python versions.

Now TensorFlow has been successfully installed we’ll also need to go ahead and install OpenCV along with all its many dependences,

$ sudo apt-get install libwebp6 libwebp-dev
$ sudo apt-get install libtiff5 libtiff5-dev
$ sudo apt-get install libjasper1 libjasper-dev
$ sudo apt-get install libilmbase12 libilmbase-dev
$ sudo apt-get install libopenexr22 libopenexr-dev 
$ sudo apt-get install libgstreamer0.10-0 libgstreamer0.10-dev
$ sudo apt-get install libgstreamer1.0-0 libgstreamer1.0-dev
$ sudo apt-get install libavcodec-dev
$ sudo apt-get install libavformat57 libavformat-dev
$ sudo apt-get install libswscale4 libswscale-dev
$ sudo apt-get install libqtgui4
$ sudo apt-get install libqt4-test
$ pip3 install opencv-python

as we’ll need OpenCV for our benchmarking script later. For the same reason we need to install the Pillow fork of the Python Imaging Library (PIL).

$ pip3 install Pillow

ℹ️ Information I installed Raspbian Lite as my distribution on my Raspberry Pi, if I’d instead installed a full version of Raspbian some of these dependences would already be present and would not need to be installed.

Installing TensorFlow on your MacBook isn’t that hard, although it can be a fairly lengthly process if you install from source. However, depending on whether you need the C++ tools like summarize_graph installed, most people should be able to get away with installing with pip which is a lot simpler.

The first step in preparing your MacBook is to install Homebrew.

$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

After Homebrew is installed you can go ahead and install Python,

$ brew install python python3
$ brew link python
$ brew link python3
$ brew postinstall python3

⚠️Warning The normal pip3 install --user is disabled for Python installed using Homebrew. This is because of a bug in distutils. So when you see a -U install in the instructions, just go ahead and install into /usr/local/ instead.

and once Python has been installed you can then install OpenCV and its bindings which we’ll need for our benchmarking script.

$ brew install opencv
$ pip3 install numpy scipy scikit-image matplotlib scikit-learn

However this will take some time, so maybe go grab a cup of coffee.

Afterward the installation is complete you can test to make sure that OpenCV has been installed correctly by attempting to import it into Python, if you don’t see an error the installation was a success.

$ python3
Python 3.7.3 (default, Mar 27 2019, 09:23:32)
[Clang 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import cv2
>>>

We can now install TensorFlow. You can either do that into a Python virtual environment using virtualenv, or directly into the system environment.

$ pip3 install tensorflow

It’ll take some time to install. So you might want to take a break and get some coffee. Once it has finished installing you can test the installation as follows.

$ python3 -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"

⚠️Warning Note that if you’ve chosen to install TensorFlow from source, you can’t import the tensorflow Python model while your current working directory is the $INSTALL_DIR/tensorflow/ directory. This will raise an ImportError, “Could not import tensorflow. Do not import tensorflow.” If you get this error just change directory out of the source directory and try again.

I’m going to use pre-trained models to run our benchmarks. What we’re going to want to do is take these pre-trained TensorFlow models then convert them, if needed, so that they work on each platform. At that point we should run inferencing in as a naive way as possible. Without much, if any, optimisation beyond that needed to get the model running to give us a baseline speed.

We can use our TensorFlow models more-or-less out of the box on both the Jetson Nano which runs a full TensorFlow installation with extensions to support NVIDIA’s TensorRT, and on our Raspberry Pi and MacBook Pro. Although as we’ve seen TensorFlow models that are optimised using NVIDIA’s TensorRT libraries will run significantly faster on the Jetson hardware.

However to use these TensorFlow models on Google’s EdgeTPU-based hardware, or on Intel’s Movidius-based hardware, we’ll have to convert them into appropriate native formats. So let’s do that now.

Google has a good overview of how to convert TensorFlow models to run on their Edge TPU hardware. You first have to take the model and convert it to TensorFlow Lite before compiling the TensorFlow Lite model for use on the EdgeTPU using the web compiler.

Work flow for converting an existing TensorFlow model to one that can run on Edge TPU-based hardware.

Let’s start out by grabbing the quantised version of our MobileNet SSD V1 model from the Coral Model Zoo along with the associated labels file. While there is a pre-converted version of this model available for download on the Coral site, let’s walk through the process of conversion for this model.

If you don’t already have TensorFlow installed on your laptop you should go do that now, then download the model and uncompress.

$ cd ~
$ wget http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_quantized_300x300_coco14_sync_2018_07_18.tar.gz
$ tar -zxvf ssd_mobilenet_v1_quantized_300x300_coco14_sync_2018_07_18.tar.gz

To convert the model from TensorFlow to TensorFlow Lite you’ll need to know what the input and output nodes of the model are called. The easiest way to figure this out is to use the use the summarize_graph tool to inspect the model and provide guesses about likely input and output nodes. Unfortunately if you’ve previously installed TensorFlow using pip then this tool isn’t going to be available, you’ll have to go back and install from it source to have access to the C++ tools.

⚠️Warning If you have LittleSnitch running you may have to temporarily turn the network monitor off if you get ‘Host is down’ errors during installation.

Then, from the TensorFlow source directory, you can go ahead and build the summarize_graph tool using bazel,

$ bazel build tensorflow/tools/graph_transforms:summarize_graph

and run it on the quantised version of our MobileNet v1 SSD model.

$ bazel-bin/tensorflow/tools/graph_transforms/summarize_graph --in_graph=/Users/aa/Downloads/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_18/tflite_graph.pb

ℹ️ Information The Edge TPU supports only TensorFlow Lite models that are fully 8-bit quantised. If you have a floating point TensorFlow model, unless it has been trained using quantisation-aware training, you will not be able to convert it to work with the Edge TPU as there is no support for post-training quantisation.

After running the summarize_graph tool you should see something like this,

Found 1 possible inputs: (name=normalized_input_image_tensor, type=float(1), shape=[1,300,300,3])No variables spotted.Found 1 possible outputs: (name=TFLite_Detection_PostProcess, op=TFLite_Detection_PostProcess)Found 4137705 (4.14M) const parameters, 0 (0) variable parameters, and 0 control_edgesOp types used: 451 Const, 389 Identity, 105 Mul, 94 FakeQuantWithMinMaxVars, 70 Add, 35 Sub, 35 Relu6, 35 Rsqrt, 34 Conv2D, 25 Reshape, 13 DepthwiseConv2dNative, 12 BiasAdd, 2 ConcatV2, 1 RealDiv, 1 Sigmoid, 1 Squeeze, 1 Placeholder, 1 TFLite_Detection_PostProcess

From here we can use the TensorFlow Lite Optimizing Converter (TOCO) to convert the quantised frozen graph to the TensorFlow Lite flat buffer format.

$ bazel run tensorflow/lite/toco:toco -- --input_file=/Users/aa/Downloads/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_18/tflite_graph.pb --output_file=/Users/aa/Downloads/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync_2018_07_18/tflite_graph.tflite --input_shapes=1,300,300,3 --input_arrays=normalized_input_image_tensor --output_arrays='TFLite_Detection_PostProcess','TFLite_Detection_PostProcess:1','TFLite_Detection_PostProcess:2','TFLite_Detection_PostProcess:3' --inference_type=QUANTIZED_UINT8 --mean_values=128 --std_values=128 --change_concat_input_ranges=false --allow_custom_ops

This command takes the input tensor normalized_input_image_tensor after resizing each camera image frame to 300×300 pixels. The outputs of the quantised model represent four arrays: detection_boxes, detection_classes, detection_scores, and num_detections.

The output from a successful run of the Coral Web Compiler.

We can then take our newly generated tflite_graph.tflite file and upload it to the Coral Web Compiler to make it compatible with the Edge TPU. Once conversion is complete we just need to hit the download button to download our Edge TPU compatible model which can be copied to the Coral Dev Board.

However we still have to convert our models to the OpenVINO IR format so we can run our benchmarks on the Intel Neural Compute Stick 2.

However this turns out to be a sticking point as the software we need to convert TensorFlow models isn’t included as part of the cut down version of the OpenVINO toolkit installed onto the Raspberry Pi. Which means we need an actual x86 machine running Ubuntu Linux with OpenVINO installed.

Fortunately, we don’t need to have a Neural Compute Stick attached. We just need to have a full OpenVINO installation, and we can do that in the cloud. Possibly the fastest way to do this is to spin up an instance on a cloud provider like Digital Ocean, and then installing the OpenVINO toolkit and running the model optimiser, the piece of software that can convert our TensorFlow model to Intel’s OpenVINO IR format on the instance.

Creating a x86_64 server running Ubuntu 16.04

Go ahead and download the OpenVINO toolkit for Linux either to your laptop, or directly to your cloud instance. If you downloaded it to your laptop you can copy it from your local machine up to your host in the cloud using scp.

$ scp l_openvino_toolkit_p_2019.1.094.tgz root@XXX.XXX.XXX.XXX:

Make sure you have an X Server running and have enabled X11 forwarding to your local machine as you’ll need it during the installation process for the OpenVINO toolkit. Then go ahead and login to your Digital Ocean instance with the password that got emailed to you, or using your SSH key.

$ ssh -XY root@XXX.XXX.XXX.XXX

You can test your SSH connection to make sure that you have X11 forwarding configured and working by installing and running an xterm on your remote machine. If everything goes okay it should popup on your laptop’s desktop.

# apt-get install xterm
# xterm

You can then go ahead and install the Ubuntu version of OpenVINO toolkit.

# tar xvf l_openvino_toolkit_<VERSION>.tgz
# cd l_openvino_toolkit_p_2019.1.094
# ./install_openvino_dependencies.sh
# ./install_GUI.sh

The final part of this installation process makes use of a graphical installer, so long as X11 forwarding is working this should pop up on your local desktop.

The installation GUI.

After installation is completed we can go ahead and test that everything has installed correctly by running an inferencing demo. Before we do that we first need to install the prerequisites for the demo.

# cd /opt/intel/openvino/deployment_tools/model_optimizer/install_prerequisites
# ./install_prerequisites.sh

Since we don’t have a Neural Compute Stick or other hardware attached to our cloud instance we’ll need to use the CPU rather than the MYRIAD device we use when running inferencing on our Raspberry Pi.

# cd ../../demo/
# ./demo_squeezenet_download_convert_run.sh -d CPU

If everything seems to work we’re ready to convert our TensorFlow models to OpenVINO IR format. Unfortunately doing this isn’t quite as straightforward as you’d expect, mostly down to the range of layers that OpenVINO supports.

It turns out that converting models originally built in TensorFlow is hard, and the instructions don’t really cover how to convert anything except the most basic of models. However let’s start out by grabbing the MobileNet v2 SSD model trained on the COCO dataset and downloading it to our cloud instance.

# cd ~
# wget http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_coco_2018_03_29.tar.gz
# tar -zxvf ssd_mobilenet_v2_coco_2018_03_29.tar.gz

From here we can use the OpenVINO model optimizer, and some fudging to convert our MobileNet model to OpenVINO IR format. If your converting a pre-trained TensorFlow model and you’re unsure of the shape you can make use of TensorBoard to load a model and examine it to determine the shape.

Fortunately the MobileNet v2 SSD model is one of the supported frozen topologies for conversion so we can go ahead and convert it as below.

# python3 /opt/intel/openvino/deployment_tools/model_optimizer/mo_tf.py --input_model ~/ssd_mobilenet_v2_coco_2018_03_29/frozen_inference_graph.pb --input_shape [1,300,300,3] --data_type FP16 --tensorflow_use_custom_operations_config /opt/intel/openvino/deployment_tools/model_optimizer/extensions/front/tf/ssd_support.json --tensorflow_object_detection_api_pipeline_config ~/ssd_mobilenet_v2_coco_2018_03_29/pipeline.config --reverse_input_channels --output_dir ~/
   .
   .
   .
[ SUCCESS ] Generated IR model.
[ SUCCESS ] XML file: /root/frozen_inference_graph.xml
[ SUCCESS ] BIN file: /root/frozen_inference_graph.bin
[ SUCCESS ] Total execution time: 91.58 seconds.
#

⚠️Warning Before running the TensorFlow benchmarking script that includes optimisation for TensorRT with the MobileNet v2 SSD model on the Jetson Nano you should remove the batch_norm_trainable line from the pipeline.config file in the model directory. This is now deprecated and isn’t supported.

We also want to convert the MobileNet v1 SSD 0.75 depth model, which was again trained on the COCO dataset, that should run somewhat faster on our hardware albeit with reduced accuracy. So go ahead and grab the model and download it to our cloud instance.

# cd ~
# wget http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_0.75_depth_300x300_coco14_sync_2018_07_03.tar.gz
# tar -zxvf ssd_mobilenet_v1_0.75_depth_300x300_coco14_sync_2018_07_03.tar.gz

This time, instead of pointing the model_optimizer towards the frozen model file, we’re going to point it instead towards the meta graph. We’re also being forced to specify the output nodes. In the same way that converting models from TensorFlow to TensorFlow Lite depends on the model and how it has been trained, converting to OpenVINO IR format isn’t entirely formulaic.

#python3 /opt/intel/openvino/deployment_tools/model_optimizer/mo_tf.py --input_meta_graph ssd_mobilenet_v1_0.75_depth_300x300_coco14_sync_2018_07_03/model.ckpt.meta --input_shape [1,300,300,3] --data_type FP16 --tensorflow_use_custom_operations_config /opt/intel/openvino/deployment_tools/model_optimizer/extensions/front/tf/ssd_support.json --tensorflow_object_detection_api_pipeline_config ssd_mobilenet_v1_0.75_depth_300x300_coco14_sync_2018_07_03/pipeline.config --reverse_input_channels --output_dir ~/ --output="detection_boxes,detection_classes,detection_scores,num_detections"
   .
   .
   .
[ SUCCESS ] Generated IR model.
[ SUCCESS ] XML file: /root/model.ckpt.xml
[ SUCCESS ] BIN file: /root/model.ckpt.bin
[ SUCCESS ] Total execution time: 46.14 seconds.
$

ℹ️ Information The exact command line incantations needed to convert a model from TensorFlow to OpenVINO IR format are not always obvious. The best place to get help on this topic is the Computer Vision forum in the Intel Developer Zone.

After conversion we will have the .bin and .xml model files necessary to use our TensorFlow model with OpenVINO on the Raspberry Pi with the Intel Neural Compute Stick 2.

⚠️Warning Make sure you’re using the same release of the OpenVINO toolkit on the Raspberry Pi as you are in the cloud. Here we’re using 2019.1.0 both in the cloud, and on the Raspberry Pi. If you have a version mis-match you may see “cannot parse future versions” errors when trying to load converted networks.

You can now copy these model files back from the cloud to your Raspberry Pi. Login to the Raspberry Pi with the Neural Compute Stick 2 and grab them from your cloud instance, along with the associated labels file.

$ cd ~
$ scp "root@XXX.XXX.XXX.XXX:frozen_inference_graph.*" .
$ wget https://gist.githubusercontent.com/aallan/fbdf008cffd1e08a619ad11a02b74fa8/raw/4183a4fd800c2d2d6211fd5129bf88d709072564/coco_labels.txt

We’re done, at least for these two models.

After you have converted all your models you can close down the instance and destroy it. This will permanently remove it from your account and ensure you aren’t billed for any further usage.

⚠️Warning If you want to look at a more permanent installation you should look at the instructions to create a Docker image which you can then deploy to Digital Ocean. As the changes we’ve made will be lost when we spin it down.

We can write a fairly short piece of code that we can run across three of our platforms; the Raspberry Pi, MacBook Pro, and on NVIDIA Jetson Nano. As these three platforms can run our TensorFlow models off the shelf.

However we’ll need additional platform specific versions of the code for both the Coral EdgeTPU-based hardware, and the Intel Movidius-based hardware. Finally a fourth version is needed for the NVIDIA Jetson Nano which will take our vanilla TensorFlow model and optimise it using TensorRT for inferencing.

$ ./benchmark_tf.py --model ssd_mobilenet_v2/tf_for_rpi_and_nvidia_and_macbook/frozen_inference_graph.pb --label ssd_mobilenet_v2/tf_for_rpi_and_nvidia_and_macbook/coco_labels.txt --input fruit.jpg --output out.jpg --runs 10000

When timing TensorFlow execution there is considerable overhead the first time an inferencing model is loaded into the TensorFlow session. Subsequent inferencing times are shorter, and more consistent. If you take a look at the benchmarking code we’ve taken account of this by discarding the first run.

Notably while the ‘first run’ time can double with the Edge TPU and Movidius hardware, it is considerably longer on the CPU and GPU-based machines. The first run on CPU or GPU-based hardware can be significantly extended, into tens of seconds. With the NVIDIA Jetson Nano seeming to be particularly vulnerable to this problem, with first inference run times extending from thirty through into hundreds of seconds depending on the model used.

ℹ️ Information In the resources that go along with this benchmarking post I’ve included pre-optimised TensorRT models. These can be passed directly to the vanilla TensorFlow script rather than running the (slower) script with TensorRT support. This should give the same result as passing the vanilla TensorFlow models to the TensorRT optimised script which will carry out this conversion.

Putting these platforms on an even footing and directly comparing them is actually not that trivial. While they’re intended for the same task, to help run machine learning inferencing faster, the differences in architecture and toolkits makes direct comparison difficult. Model conversion is an especially difficult problem, with both Intel’s OpenVINO IR format and quantisation for Google’s TensorFlow Lite proving tricky depending on your model.

If you’re interested in getting started with any of the accelerator hardware I’ve used in this benchmark I’ve put together walkthroughs for the Google, Intel, and NVIDIA hardware. This, combined with the resources from this post, should be enough to get you started.

This post is sponsored by Coral from Google.

707

707 claps
Alasdair Allan

Written by

Scientist, Author, Hacker, Maker, and Journalist. Currently freelance, building, breaking, and writing. For hire. You can reach me at 📫 alasdair@babilim.co.uk.