Building Our Own Deep Learning Image Recognition Technology

Published in

carsales-dev

5 min readJun 7, 2019

A Deep Neural Network is a very powerful Machine Learning (ML) model which generally performs better than simpler ML models as it has more capacity to learn complex relationships between input features. However, it’s not easy to train one. One of the primary issues is the need for large training sets otherwise your model will overfit.

Building our first Deep Neural Network

The first Deep Neural Network we built at Carsales was Cyclops 1.0 around three years ago. It’s an image classification model which can categorise a photo of a car into 27 categories; boot, passenger seat, dashboard, full front, full rear, side mirror, etc. The aim was to help our photography teams to automatically classify photos of cars they took at dealerships. Prior to Cyclops, each photographer needed to classify 500 photos a day costing the business approximately $250k annually.

Photographers needed to classify the angle of the car in the photos they took

We implemented Cyclops using the CNN (Convolution Neural Network) architecture with 48 layers. CNN is the most popular AI model architecture when your input data is an image due it’s ability to extract 2D spatial relationships. Why can’t we just unroll the RGB pixel values of the image into 1D arrays and pass them into a non CNN based ML, say a linear regression? Because you then lose the information of the relationship between two pixels above each other. Eg: After you unroll them, they are no longer close to each other anymore. CNN preserves this relationship.

Another benefit of CNN is that it is displacement invariance which is very crucial in image classification. Eg: We want to classify both images below as a mouse, even though the second mouse is slightly shifted to the right. CNN can handle this naturally whilst others can’t.

Images of mouse, but the mouse in the right image is slightly shifted to the right

Overcoming training set limitation

Training a Deep Neural Network as complex as Cyclops 1.0 requires a lot of images to avoid overfitting (when the AI learns too much of your training set but does not perform well when dealing with new data). We need at least 1,000,000 images, spread across 27 categories, equal to 36,000 images per category. This is not that practical for us as some of our categories only have around 1000 images at most. These are the photo angles our photographers rarely shoot but we still want to support. In order to make sure our training set have balanced classes, we need to limit our training sets to 1000 per images, given there are 27 categories we will only end up with 27,000 images. Pretty far from 1,000,000 images huh??

We overcame this issue with a technique called Transfer Learning. We did not train our CNN model from scratch, instead we used a model which was pre-trained with the ImageNet data sets (images of birds, dogs, cars, buildings, human, fruits, etc.) and then started training with our data set from there. This way we reduced our training set by 100x. From 1,000,000 down to 10,000 images and equal to only 380 images when they are distributed across 27 categories. Yay!!! Why does transfer learning work? Well, if the AI has been trained to recognise all kinds of objects, it has already learned the concept of simple shapes like rectangle, circle, square, etc. So it makes sense that we can reuse this knowledge to learn to recognise car features (which also consist of the same simple shapes). Very neat huh?!

Not only did it bring down the size of training set required, it also reduced the training time by 100x. Cyclops 1.0 was trained using my old MacBook Pro in 5 hours.

Flip invariance

Just like any TV series, any story won’t be complete without a drama! Transfer learning is awesome. However, we faced a major issue, a flip invariance. When you pass the two images of a mouse below to an image classification model, it will tell you that they are both mice, regardless the direction (horizontal flip). This is not an issue for most businesses, however it’s a big issue for us. We need to classify the two images of cars below differently. The left one is a Side Driver and the right one is a Side Passenger. Blindly training it to differentiate the two classes only gives us 65% accuracy which is pretty bad since the worst you can get is 50% when you choose randomly between the two.

The top images are ok to be classified as mouse. The bottom images we want to classify them as Side Driver and Side Passenger

Our AI has been trained with transfer learning so it inherited this flip invariance behaviour. There are only two possible solutions. 1. is to train without transfer learning which is not practical for us. 2. is to perform a brain surgery to edit the weights which are responsible to ignore direction. It’s similar to performing brain surgery to edit millions of neurons to remove nicotine addiction. The fact that no one has done it proves that it is an impossible feat.

However, there is a work-around! I always believe that there is always a work-around to any problem. Why don’t we simply train our model with the left half of the photos? This way the AI never sees a complete photo of car and some of the objects found at the front of the car (eg headlight, Ferrari logo and side mirror) do not exist at the back of the car so they can be used to classify the image better. And it worked! We are getting our accuracy back to 97.2%.

Training Cyclops with half of the car photos

Where are we now?

Cyclops 1.0 tech is now being used to classify photos from various sources (Private Sellers and Dealers), not just from our photographers. It is processing around 100,000 photos a day and driving many important products. A great example is our Private Photo Advisory features which reminds our private seller if they forgot to upload important photos such as the boot and a passenger seat.

We have also enhanced the Cyclops technology further to recognise car make, model and body down to a trim-level better than any human. But, it’s a story to tell for another day!

At carsales we are always looking for fun and talented people to work with. If you liked what you read and are interesting in joining, please check out what positions we have available in our careers section.

Building Our Own Deep Learning Image Recognition Technology

Building our first Deep Neural Network

Overcoming training set limitation

Flip invariance

Where are we now?

Written by Agustinus Nalwan