Recognising Beer with TensorFlow

One of the demonstrations we have in Accenture Labs showcases a screen, with a camera which you can show a beer. The screen then recognises the beer and gives you information about it. It looks a little like this:

It actually looks exactly like this, because this is a screenshot of it running live with the extremely attractive dev team.

Previously, our beer recognition algorithm used a SURF classifier from OpenCV (a robust library for all kinds of computer vision use cases). This implementation had some issues — the performance of processing the entire frame was poor and the beer had to be held right up to the camera to be recognised.

There were plenty of ways to solve these problems (including but not limited to cropping the area of interest and spending time to properly understand the parameters SURF provides), but instead we decided to try something completely different and jump on the TensorFlow train.

Google’s TensorFlow framework made a bit of noise when it was first announced. Google made it seem as if we were getting this amazing look inside the machine; that mere mortals were now able to use tools that Google engineers had previously developed only for internal use.

The truth is that TensorFlow is a pretty good tool for running neural networks, which is simultaneously more or less exciting than the hype depending on your perspective.

One of the nice things that TensorFlow does ship with is Google’s high quality (and pre-trained) Inception-v3 model.

Inception-v3 looks like this:

It’s okay, I have no idea what this means either.

This is a good place for a pretty serious caveat: while I actually have some experience in many of the topics I cover on this blog (mobile development, location services, natural language processing) I know next to nothing about neural networks and deep learning.

The good news is, thanks to the examples provided in TensorFlow, I didn’t have to. And now, you don’t have to either.

Retraining Inception-v3 to Recognise Beer

This guide is by-and-large a retread of Google’s own material on the subject with some personal commentary and notes about my experiences.

I initially thought I was going to train the model from scratch, so I pulled out one of our Mac Pro workstations and set to work. One and a half weeks later, the training was still running. Fortunately I had since aborted this plan.

There are actually some very smart people in Accenture Labs who understand this subject quite well (unfortunately I’m still trying to encourage them to write blogs). Our Systems & Platforms research group has a rig packed with Nvidia’s Titan GPUs that they use exclusively for neural network training. Because I enjoy fumbling through new technologies so much, though, I ignored the large body of research we already had in this area and set out to learn everything the hard way.

It turns out that training a model like this requires a lot more maths than the Mac Pro’s CPU could get to in any reasonable time. CPUs are super flexible and have lots of neat instructions for doing stuff on usually around four to sixteen cores. GPUs, on the other hand, excel at doing floating point calculations on thousands of cores simultaneously.

Furthermore, neural networks just take a really long time to train, and even with a GPU rig (which I did not borrow) I was still looking at potentially weeks of training time. That’s not agile.

If academic papers are your thing, read this one. Simply speaking, instead of training an entire neural network it is possible to retrain just the final layer and still get pretty good results. This can be done on a standard laptop, without a GPU (I ran it in a Docker container on my MacBook Pro), in about half an hour.

Step one is to acquire and structure the data. Over an hour or so, Allison and I took around 150 photos of Lagunitas IPA and Crazy Mountain Pale Ale in various lighting conditions, at different levels of zoom, held by people, not held by people…

My Photos app looks like this. And you can keep scrolling, for quite a while, in both directions, for more of the same.

Once we had all the data in a central location, I docker cped it over to my TensorFlow container in the following directory structure.

Then it was just a matter of using Google’s provided example code:

bazel build tensorflow/examples/image_retraining:retrain
bazel-bin/tensorflow/examples/image_retraining/retrain --image_dir /tmp/beer-retrain

Our initial results were lacklustre. We found that, because all the images were taken in the Labs environment, the model was over-indexing on features of our décor that were over-represented in one set or the other. I fixed this in a rather hacky way: by creating a new class called nothing_interesting and filling it with garbage photos of the Lab, containing no beer. For good measure, I chucked in the flowers data from Google’s tutorial to provide more negative examples. The final training data directory looked like this:

We also downloaded images from Google Images to help provide some more variety.

After retraining the algorithm, it performed incredibly well — much more so than our existing SURF classifier. The model is in the form of two files: a Protocol Buffer serialisation of the model graph, and a simple ordered text file providing the human-readable names of the class labels (e.g. lagunitas_ipa, crazy_mountain_pale_ale). The model can be easily transported between machines, and is small enough to check in to source control.

Now that we had a trained model, it was time to put it to use.

Integrating our TensorFlow Model into the Demo

Our demonstration is a native Windows application written in C#. TensorFlow is a Python library that only really runs on Unix. We wanted the demo to run locally, with no network latency.

What to do?

One of the coolest new Windows features (in the anniversary update for Windows 10) is the ability to run a full-fledged Ubuntu bash, resplendent with the ability to apt-get Linux packages and have them run more-or-less seamlessly without so much as a virtual machine. 2016 is truly the year of Linux on the desktop.

Using Flask, it was simple to build a really simple API that took a binary request of an image file and returned the recognised objects (in JSON) from the TensorFlow model. The Windows application was already written to account for long-running (relatively speaking) SURF classification, so it was simply a matter of extracting our recognition interface and implementing a version that used the newly developed REST API.

The entire code of the Python web service is reproduced below — it really was this easy:

If you want to know more I do highly recommend reading TensorFlow’s original documentation on this subject. Tools like TensorBoard, which can help explain what is going on under the hood of these models, are covered there and not here (where I’ve kept to my usual get-your-feet-wet level of technical detail).

There’s a great rabbit hole of material to be explored here. Deep learning is a fantastically useful tool. While it is oversold regularly in the tech press, it can be really fun to play around with and is worth learning more about. I’d love to hear about other people’s experiences with TensorFlow and neural networks.

Today, Friday the 23rd of September, is my last day with Accenture. Working for Accenture has been a huge part of my identity on this blog and in life, and I am not yet sure how the content will change as a result.

I do plan to continue writing and experimenting with new technology in my new role as a Partner Engineer at Facebook.

So really, I am just getting started.

I guess what I am trying to say is: don’t unfollow me, please.