Day#4: AI steering — (Re)training an image classifier with transfer learning.

Vivian Allen
Team snAIl
Published in
7 min readMay 3, 2018

We are a small team of student developers on a 16-week coding bootcamp at Makers Academy. This is part of a series of posts on an ongoing two-week project to teach ourselves to apply machine learning to a basic self-driving car. If you want to see sn-AI-l and the other student projects at Makers, pop along to the demo day on Friday 11th May.

The final projects on the bootcamp course at Makers are intense, two week development sprints in which we aim to apply everything we have learned in the previous few months to a self-directed project. Choices are fairly open, and we are encouraged to pick something we wish to learn more about, rather than something we can already do.

Our small team was based on a shared interest in the applications of machine learning (ML). One us had already built a small, Rasberry-Pi powered model car (if you want to build one yourself, see our hardware blog for details). Attempting to develop an ML-based control system that could learn to navigate our car around a simple track therefore seemed like an interesting challenge.

The control system we are building is, in principle, pretty straightforward. The tracks the car has to navigate are just lines of electrical tape stuck onto the wooden floor of the loft at Makers. We have a cheap webcam sellotaped on the top of our car that can see the track immediately in front. The car moves in discrete steps, which are either:

  • a) Move forwards a step.
  • b) Pivot a small amount to the right.
  • c) Pivot a small amount to the left.

This is a very basic system, but we reasoned that it allowed the car to navigate most situations, albeit very slowly. It doesn’t read well in the medium font, but we called it snAIl (sn-AI-l looks better), or AI-snail. Because it’s really slow. This is what passes for humour when you’ve been arguing over a whiteboard about the correct order of operations in control system architecture for three days.

sn-AI-l. The flamboyant rubber-band rims were added for grip when it stopped turning properly on the shiny wooden floor. Turns out it was just because half the batteries were dead.

Anyway, before each step the webcam takes a photo of the track ahead of the car. This is where the ML comes in — we feed that photo through a deep-learning image classifier. Image classifiers are more typically used to distinguish between pictures of different objects: ‘cat’ or ‘dog’, or to use the famous example from HBO’s Silicon Valley, ‘hotdog’ or ‘not hotdog’. Our image classifier will look at the picture of track and decide whether or not it looks like the car should carry on straight forwards, pivot to the left, or pivot to the right.

The training track.
How sn-AI-l sees the track. The correct manouevre based on this photo would be ‘move forwards’.

BUT. In order for the image classifier to be able to look at a picture of a track and decide whether the next move should be right, left or forwards, it needs to be trained. Training image classifiers is notoriously labour-intensive. In order for a machine to learn to associate an image with a given label or ‘class’ (in our case, ‘left turn’), you first need to provide it with (minimally) thousands of images that are clearly labelled. In our case, this would be thousands of ‘left turn’ images and thousands of other images that are clearly labelled ‘right turn’ and ‘forward’. Also, unless you have access to a powerful computer cluster, you have to allow the algorithm to chew through (minimally) tens of hours of cpu time to teach itself to reliably tell the difference between ‘left turn’ and ‘right turn’.

We can generate a labelled set of images for ‘left’, ‘right’, and ‘forwards’ by manually telling the car what to do for each movement step, and then saving each photo with a label for the resulting move attached. But we only have two weeks, and we can only generate so many pictures even if we spent the entire two weeks driving the car around.

Also, a large portion of that two weeks may be (and actually has been, so far) eaten up trying to teach ourselves to do test-driven-development (TDD) in Python, then building a control system for the car, which can talk to the webcam sellotaped to the top of it, relay the image to our image classifier, and then translate the image classifiers output into control signals for the cars motors, etc.

So we actually have considerably less than two weeks to spend on the classifier. How can we train a working, ML image classifier in less than two weeks, on our laptops, with a relatively small dataset of images? I think the short answer is: we can’t. Or at least, we couldn’t, were it not for transfer learning.

Transfer learning is the repurposing of a machine-learning model from classifying (or predicting) one set of things to another. This means that you can take a model someone else has spent hours and hours training with thousands of painstakingly labelled images, one that originally classified, say, handwritten letters or something like that, and retrain it to recognise something else (turning left, for example).

This is possible because of the architecture of deep-learning models. A deep learning image classifier is (in my understanding) made of an input layer (that loads images) and an output translator (that returns a classification prediction) sandwiching a number of layers of artificial neurons: the ‘neural net’. These ‘neurons’ and the connections between them comprise the mathematical guts of the machine, and I am not going to pretend that I understand them well at all. As far as I can follow, the neural net is in some way similar to a linked system of regression equations, in that it is a collection of things that deal with the strength of the association between two signals. These components are arranged and connected in such a way as to feed into each other and strengthen their ability to detect and predict that association.

The classifier machine’s job is to translate a complex image file into a smallish set of meaningful numbers (multidimensional vectors, or ‘tensors’). These numbers can then be interpreted in terms of the probability that the original image belongs in one of two or more classes (i.e. the likelihood that the image is either a cat or a fish, or a right turn or a left turn, etc). The individual layers represent, in some way, successive conversion (or abstraction) of the input image into the set of numbers that the final output layer uses to classify the image.

The higher layers of abstraction in the model will relate to things more specific to the images the classifier has been trained to detect. If it’s a cat detector, for instance, they might in some way contain an abstraction of whether or not the image has characteristic cat-shaped ears. If it’s a hotdog classifier, some of the higher layers might relate to mustard detection.

However, most of the lower layers of the trained model will relate to basic image-recognition type stuff: detection of edges, detection of corners, parsing colour or brightness gradients and patterns, etc. As these jobs are common to pretty much all types of image recognition, these layers of the model are, it turns out, pretty similar between different models.

From our perspective, this means that we can just rip the top off an existing classifier, and within a few hours, retrain it to reliably recognise our ‘left-right-forward’ dataset. So that’s what we did. Adapting scripts and methods taking from the excellent tutorials on the tensorflow website, we retrained Googles (deeply impressive) Inception V3 image classifier to make manoeuvring decisions based on the images coming from our cars webcam.

Today, we took turns to manually drive our car round a simple track for a couple of hours. In total, we gave the car 1,975 ‘drive forward’ instructions, 928 ‘pivot left’ instructions, and 988 ‘pivot right’ instructions, each with an associated webcam image. We used this data to retrain the Inception classifier, which took less than an hour(!) on a macbook pro.

Giacomo taking his turn in the driving seat.

Based on this, the retrained model was able to correctly predict whether a separate set of unlabelled images (taken after the main set), represented left, right or forwards (we had a list of what each image was, but the machine wasn’t allowed to see it). The results were comfortably below the 99% accuracy a well-trained classifier like the original Inception model can achieve, but they were still pretty impressive for a first attempt. The model assigned probabilities between ~55% and ~%85 that the images belonged in the correct category. More importantly, in almost all of the test images it assigned the highest probability to the correct category (‘first choice accuracy’). This means that even if it’s not totally sure about it, it’s still making the right choice most of the time, which is totally good enough for our purposes.

A ‘move forwards’ and a ‘turn right’ image, that our classifier was able to label correctly.

Next up, we will integrate the retrained ML model into the control system for the car, and see how well that accuracy translates into effective navigation!

--

--