Practical Lessons learned while implementing Image Classifier

Couple of months ago, we, Developer Sessions team started a project for the blind community of Nepal. The project is Cash (Monetary Notes) recognition for visually impaired.

Here is Intel Software Blog article written by me about this project

In Nepal, the monetary notes (Cash) are not accessible for visually impaired individuals, there are no special markings in these notes to let them know what they are carrying. Some, with years of experience finally learned to recognize them while some still have to seek help of others to know the value of the note they are carrying.

We tried to solve this problem by creating a smartphone app. As most of the blind individuals uses smartphone nowadays with the help of accessibility tools present in them, we thought it would be a great tool to solve such problem.

The idea is, the individual would hold the cash in one hand and scan the note with the smartphone camera using our app by the other. He/she would press and hold the screen to capture an image. With the help of deep learning, this captured image will be classified with correct value of it and appropriate audio signifying the value of the note will be played, to allow the individuals to recognize them.

We divided the project development approach into two parts

  1. Prototype phase with only 2 categories
  2. Final product with all categories and a working smartphone app

In this post, I am going to share some of the lessons I learned while completing the first part.

Building the Prototype (proof of concept)

For the prototype we decided only to train on two categories, Rs.10 notes and Rs.20 notes. After successfully training on these categories, we thought we would know what kind of data we will need next and what architecture would be good for it.

While starting with this approach, we were clear that we will be using transfer learning technique for this but we were unsure how much data and what kind of data would be needed to yield good results.

So the whole prototype development phase revolved around tweaking, adding and playing with new data and transfer learning code.

This approach went through different stages in itself. For better understanding I am going to classify them into 4 parts:

  1. Less data and simple transfer learning code

We started by capturing the images of Rs.10 and Rs.20 notes using our own smartphone camera for the training and validation set. The images were simple in nature, like photos of these notes lying around in the floor, on top of different objects, carried by hand etc.

Initially, the training sets were around 200 in each category and validation set around 50 in each. Using Keras, I used VGG16 model with imagenet weights and just used transfer learning on it with RMSprop optimizer. I ran it for 50 epochs and the result was amazing. 98.6% accuracy on training and 97.5% on validation. I was shocked and amazed.

This amazing feeling quickly vaporizes when I tested it with new data freshly taken from my smartphone. Model was performing terribly. I tested it with 100 new data with some different background, orientation and perspective, it was only 40% accurate on each category.

I think the common intuition by anyone at this point would be to increase the data, which I did next and plus the model was slightly overfitting, I decided to add a Dropout layer as well. Which leads us to the second phase.

2. Bigger data with slightly modified code

We increased the data to 600 for training and 120 for validation for each category. The data we added were simpler in nature than before. Mostly of notes with full sizes , placed on the floor, with proper lighting conditions and with different orientations. We also added data augmentation now on training set and only scaling augmentation in validation set.

The intuition behind was that, if we train the model with more simpler data, may be it can generalize and recognize the images which have only few portions of it. The idea was to train the model on full size clear images and use that knowledge and trained representation to classify on difficult images, like images where there are only 50% of a note is visible, images of notes held by hand, images with low lighting conditions etc.

After training the model for 20 epochs with above architecture and learning rate low as 10–5 with ADAM optimizer, and just transfer learning without fine-tuning further down the layers, we got around 94% of train and validation accuracy.

But when we tested the model with images like below, the result was unsatisfying.

The problem was that, some of the portions of each category notes have similar features. If you look at the first image above of Rs.10 note, Rs.20 notes also have similar image or design in it with white background, the only difference is of the color. If you look at the 3rd image of Rs.20 above, Rs.10 notes have also similar image of a mountain in it with different color.

Surprisingly, our model is not considering color of the notes as the dominant factor, which from a human point of view should be the first thing to consider while classifying between them. We thought it is because of this confusion our model is not performing well.

So, we added some more data and made our training set to 1000 and 200 for validation set for each category. This time we added more images like above, for which our model got confused.

We used the same model architecture and ran the training again, fortunately our model performance increased but it stayed on to 94–95% accuracy on training and it also failed to classify images which has different orientation and perspective like below:

At this point, we were unsure if only increasing the data would help and we wanted to try out new architectures and experiment with it. Which leads us to the 3rd phase.

3. More added data and model experimentations

It was obvious that the data we had, didn’t help our model to generalize well on test set. We had to increase the dataset size with a hope that after training the model with sufficient dataset it will perform better.

So, after asking our with experts online and reading through various blog posts, we increased our dataset to 2000 images per category. At this point, we have all kinds of data in it, notes with full size front faced, different orientations, different perspective, held by hand, placed on objects, different lighting conditions etc. We tried including every possible variations we could in our dataset.

But right after increasing the dataset, we first trained it on the same model architecture and wanted to know how it will perform. The accuracy level remains around same, even at one point it dropped to 93% on training accuracy.

We knew something is wrong with the model. So, at first I tried changing the pre-trained model architecture from vgg16 to resnet50. Hyperparameters remained the same. I used transfer learning followed by fine-tuning on all the layers of Resnet50.

For unknown reasons, the model was terrible, it could not recognize Rs.20 notes at all.

Out of frustration I started looking for alternatives and different variations of transfer learning architectures. Almost every blog post you could find uses transfer learning with only one Dense layers and with little number of neurons like 256. When I looked at my model, I already had 2 Dense layers with 1024 neurons each. Out of curiosity, I switched back to VGG16 model and used only one dense layer with 1024 neurons and even removed dropout layers as my model was slightly under-fitted.

To my surprise, with Hyper-Parameters remaining same, the training and validation accuracy was around 98% with only transfer learning enabled. The reason I used two dense layer was that, at first when I had small dataset, I used them to increase the model size. But later on when my dataset was increased I forgot to tweak the model and the model architecture remained the same.

When I fine-tuned the same model with VGG16, the accuracy went up to 99%.

Then I realized how much misleading this training and validation accuracy score is. When I tested on test data set, it performed with only 65% accuracy with Rs.10 notes and around 90% with Rs.20 notes.

At this point I thought, adding more data will not help that much and tried looking at slightly different approaches, which leads us to the 4th phase.

4. Same data different Architecture

Similar to VGG16, a slightly better version for pre-trained model is VGG19. Which has similar conv blocks but more layer numbers and is very easy to work with for transfer learning.

In my previous training steps, I used learning rate as 10–5 for transfer learning and 10–7 for finetuning. With all the architecture remaining the same, with same hyper-parameters, I switched the pretrained model to VGG19 and began the training process.

The training and validation accuracy after Finetuning went up to 99%

But this score remained misleading as before but with much better performance on Test Set.

When tested on Test-Set, Rs.10 notes performed around 80% accuracy and Rs.20 with 92% accuracy. When the same model was tested on the validation set used before on training, Rs.10 accuracy was around 90% while Rs.20 was around 96%.

After months of researching and training with different data, we agreed to release this version of model as our first prototype.

Visualizing and Understanding the model

I used Keras library for model creation and training purpose and for visualizing activations on Keras, there is a great package called Keras-Vis. I used this package to visualize and understand what the network is seeing when we tested images on it.

Fig 1: Rs.10
Fig 1: Rs.10 Activations

As you can see when tested the model with Rs.10 notes of different variations, different portions of the images light up, showing us exactly what the model is recognizing as Rs.10 notes. As you can see the first image and its activation, the model is pretty confident that the circular portion is of Rs.10 notes as it lights up more than the other portions of the image. Similarly, if we look at the second image above and its activation, the top left portion is more bright, showing us it recognizes that portion more as Rs.10’s portion.

Fig 2: Real Rs.10
Fig 2: Rs. 10 Activations

As you can see, there is still some obvious mistakes here, as the first image of Rs.10 above, the network is still unsure if its Rs.10 or not. Similarly with 3rd Image.

Lets look at Rs.20 activations as well.

Fig 3: Rs.20
Fig 3: Rs.20 Activations

If we look at the 2nd image above, as its blurred by default, the activations are less bright, and if we look at 4th image above, we can see clearly the text showing ‘20’ in Nepali language is more bright symbolizing network recognizes that part more.

Fig 4: Rs.20
Fig 4: Rs.20 Activations

Thoughts after developing the prototype

While working on this project, I definitely learned a lot and gathered some practical knowledge while dealing with such data. Here are some of the insights that I think can be shared and I hope it will be helpful to others as well.

  1. Never trust Training and Validation Accuracy Score, always try on your test data after the training.
  2. Start with small architecture model and gradually tweak the model as you increase or decrease the data.
  3. Don’t be impatient like me, use standard tests and procedure to find out better hyper parameters.
  4. First train on moderate dataset and try to make sense of the predictions and increase/decrease the dataset like wise.

I think web has great deal of resources already available, my suggestion would be to first always use the model tested and trained by others and only if it didn’t work, tweak and modify that.

The prototype we have developed will be open-sourced along with the data here.

If you find anything I have written here is wrong, please let me know. I am more than happy to correct my mistakes. I welcome any opinion, suggestions, advice from you. As this project is a public community based project and its open source in nature, we will be very glad if you could contribute to it in any way you can.

Now after this prototype, our course of action will be to increase the number of categories in this model which will include all the available cash notes categories. We will constantly identify the errors and wrong predictions from our model and improve it constantly by adding more data or tweaking the network in a way that is more fruitful and effective. For that we encourage you to contribute and help us to make it a much better model than it is now.

We will also work on our App, the current prototype of the app is built on React Native and uses remote server to classify and predict the images. In future, in our final product we will try to make it offline as network availability is still very rare in public places in Nepal.

I would like to thank all my team members and the community who supported us in this project. I would like to thank Intel Software for providing all the support and Intel DevCloud access which allowed us to train our model more rapidly. In future as well, as we progress this project towards its final version, we expect your support and constant suggestions and advice to improve it.