Bring Magic To Your Mobile App With Deep Learning

Teaching Your App To Detect Traffic Lights From 18,000 Images Is The First Step In Building Your Own Self Driving Car

Machine learning is one of the fastest-growing and most exciting fields out there, and deep learning represents its true bleeding edge. Since my days back at the University I was always fascinated by this field, however, back in 2009 accessibility to deep learning software, hardware and data were not really in reach. Since then things has changed.

In this short tutorial, I will walk you through some fundamental steps in deep learning and demonstrate how you can start experimenting with it yourself. We will create an app that will recognize traffic lights using live camera data.

Before we go ahead I want to give an honest disclaimer. This tutorial sometimes skips concepts and cut sharp corners just to make it a bit less intimidating and provide fast access to real-time testing. In addition, we will only review deep learning in the context of image classification to make it a bit more straightforward.

The Traffic Light Detection Application We Will Build Running On My iPhone 7

Introduction

I assume that the term Deep Learning is not new to you. You are probably aware of many usages of machine learning in your everyday products such as Facebook’s face detection, Apple’s Siri, Mobileye’s car collision detection and more. However, with the recent release of open-source deep learning frameworks, not only the giants are able to launch these type of products, but startups are now able to put up a fight as well. If it’s Clarifai that are providing your application with vision capabilities, or AIDoc that are challenging the medical radiology industry — It seems that the landscape is changing.

Deep Learning

What is deep learning? What are neural networks? How does it work? I can go on and on here. Instead, I find that the following video by Andrew Ng is a great introduction to deep learning. Although a bit long, Andrew explains it with great examples and insights.

If you don’t have the time to go over it, then the main is as follows: In the past, in order for our computer to perform various tasks, we developers had to hand-craft algorithms for each and every small problem. The power of machine learning is to provide learning capabilities using the same algorithm. It finds out by itself what is important about the problem and tries to solve it on its own. As you will see later, except for providing data to the algorithm, we barely change anything to make it learn a new concept: Traffic lights.

The Training Process

Training our deep learning neural network is the actual learning step of the algorithm. We provide a dataset of classified images to the algorithm and expect it to learn how to classify new images that were not a part of the dataset used to train it.

Ideally, we would like to provide all our available data, called the Training set, to the training algorithm. However, just like humans, the algorithm needs to have feedback during the training process to see if it is doing well or not. To make this happen, we need to feed the algorithm with a separate validation set to provide feedback to the learning— We call this dataset the Validation set.

After completing the training, we would like to estimate how well are we performing with input data that was not used during the initial training. You guessed it right — we will need a third dataset called the Test set that will help us figure out what is our accuracy.

To sum up we will need 3 different datasets: Training, Validation, and Test.

Transfer Learning

In practice, we don’t usually train our deep learning networks from scratch. This is because it is relatively rare to have a dataset of sufficient size that is required for complex tasks such as image classification.

Detecting if an image contains a face with high accuracy requires a dataset of millions of samples

Instead, it is common to pre-train a network on a very large dataset and then use it as an initialization. In our case, we will use a pre-trained network that was trained with a dataset of over 1 million images and then use it to learn a different classification problem for which we only have ~20,000 images.

This trick actually works, as part of the learning task is shared between classification problems such as detecting edges, color or even different shapes.

The Mobile Opportunity

If you paid enough attention by now, you understand that data is the major player in the deep learning game. Without enough quality data, our algorithms will not be able to generalize the given problem and produce bad results in the real world.

Here lies a huge opportunity for mobile developers. With over 2 billion (!) devices worldwide that constantly capture data of different kinds, it is possible to build applications that will collect high-quality data, labeled or not, that can be used for training and learning. Many very successful startups were built just around this idea

Deep Learning With Caffe

Caffe

Caffe is a deep learning framework developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. It is widely used by Computer Vision researchers around the world. Although it was mainly built for Computer Vision, it can also be used for many other deep learning tasks.

DIGITS

NVIDIA’s DIGITS simplifies common deep learning tasks with Caffe using an intuitive web interface such as:

  • Managing datasets
  • Designing and training neural networks on multi-GPU systems
  • Monitoring performance in real time with advanced visualizations

DIGITS is completely interactive so developers can focus on designing and training networks rather than programming or debug.

Required Hardware

Training a deep neural network is usually a compute-intense operation. Although you can do it on a standard CPU-only machine, realistically you will need a powerful machine with a solid GPU such as NVIDIA’s GTX series (approx. $500-$700).

If you don’t have the resources to acquire such a machine, you can always rent one from Amazon. You can use the g2.2large instance and with services such as spotinst you can get it for approx. $0.2 per running hour. Contact me if want to get more help with that.


Building a Traffic-Light Detector App

Let’s start building the light detection app as you seen in the video above.
As always, No need to worry — All code is published on GitHub.

You should have both Caffe and DIGITS installed to follow this tutorial. We will be using DIGITS web interface so don’t need to learn any language or framework throughout this tutorial.

Using Caffe & DIGITS, we will perform transfer learning on a pre-trained network and teach it the new concept of red/green traffic lights. Next, we will deploy the trained network to our simple mobile app.

Traffic Light Data

Many startups and initiatives publish extremely useful labeled training sets to the public. Some create challenges where people can compete to win a grand prize (Udacity, Kaggle).

For the purpose of this simple application, I will be using Nexar’s Challenge #1 dataset. The dataset contains 18,659 labeled traffic light images. For each image, someone manually tagged the image whether it contains a red light, green light or no traffic lights at all. In this simple application, we choose only to learn the existence of a traffic light in an image and not it’s location or other interesting insights

Samples from Nexar’s Challenge #1 Dataset | The set is split to three categories: Background, Red Light, Green Light

Training

For this application we will do a transfer learning from GoogLeNet. Luckily, DIGITS already have this as a part of its standard networks. We only have to download the pre-trained caffemodel file and load it using DIGITS.

Fortunately, I have access to a really great machine that is used for VR experiments (but that is for another post) with a GTX 1070 GPU running Ubuntu 14.04. Watching the training process can be hypnotic to some people as you watch the accuracy go up.

Creating A Dataset With DIGITS

The first step is to to create the training/validation/test datasets. DIGITS offer a fast approach by asking you to provide all your samples split to directories according to their labels together with the percentage you would like to keep for validation and test. In our case, we direct DIGITS to a directory with 3 sub-directories: red, green and background.

Choose GoogLeNet and then Customize to start editing the default network

The next step is to create a model. DIGITS already comes equipped with our desired network called GoogLeNet. GoogLeNet was trained to classify images to 1000 different classes. In our case, we want to train it for 3 new classes, so we will have to customize it a bit. No worries, these are simple text modifications.

Customizing GoogLeNet to classify 3 new classes instead of the previous 1000

The modification we have to make is very simple: Find/Replace all occurrences of loss1/classifier, loss2/classifier and loss3/classifier to have the suffix ‘/retrain’ such that loss{x}/classifier becomes loss{x}/classifier/retrain— We do this to force our net to re-learn this part the network as it will be completely different than the previous setup where we had 1000 different classes.

To perform transfer learning we will load the pre-learned network information (called weights), by just pointing DIGITS to the downloaded bvlc_googlenet.caffemodel pre-trained caffe model.

Congratulations. You are now ready to start the training process!

With default training options, DIGITS transfer learning produce a network with 93.1% accuracy on its validation set! Isn’t it amazing ? We hardly did anything :)

After the training is complete we can download our model using DIGITS, and get the necessary data we need to deploy this model on our mobile application.

Building the application

Fortunately, I didn’t have to work too hard to get Caffe running on iOS as some prior work was done by Aleph7 and noradaiko. We will use the work by the latter as the base for our project, however, we will tweak it a bit so it will work on the live video stream from our camera.

The code is very self explanatory, so let’s review the most important parts of our ViewController.mm file:

We create a new classifier by providing it with the files downloaded from DIGITS, that is the model.caffemodel, mean.binaryproto, labels.txt and deploy.prototxt.

We need to feed every frame for classification. To do so, we need to convert our UIImage to an OpenCV matrix (line 2), resize it to fit our net and convert its color order from RGBA to BGR. We do this as Caffe is expecting the input in this format. You can notice that there is a magic number “3” when calling Classify. This should be the number of classes in your dataset — In our case: Green, Red, Background.

Next, we just iterate on the predictions and capture the prediction with the highest probability. If it passes some threshold value (Currently set to 60% confidence) we update our UI with the correct traffic light image.

Ready to roll with your own data? Awesome!
All you have to do is to change are the samples at the beginning of the training process and change the number of classes. Everything else stays the same.

Here is another take of the app we just overviewed:

The Traffic Light Detection Application We Will Build Running On My iPhone 7

Can I Use It In My Production App ?

Well, no. At least I wouldn’t recommend it at the moment. The current port is heavy and cumbersome to integrate with.

You should use this project to experiment and test your algorithms outside of your desktop environment. It’s a great way to roll-out POC projects without coding anything. You can also use it to build fast applications to collect new data to be fed during your training process.

Once you get the hang of it, you can easily take the next step and move forward to learn TensorFlow or any other home brewed libraries such as BrainCore that require a bit deeper understanding of deep learning.

If you like what you read and would like to keep these coming, please tap ♥ below — It really fuels the next post

You can also follow me on Twitter