Train A Strong Classifier with Small Dataset, From Scratch? ImageNet Weights? Or AutoML? — Part 1

7 min readAug 11, 2018

Thanks to advancement of deep learning, image classifier is now commodity tool. We have many tutorials and blog posts that explain how easily we could build one. And we can even use automatic model creation tool like AutoML recently, then what could be motivations we still want to make our own?

As many beginners would have experienced, building a classifier is simple and easy by following tutorials. But in many cases, we want to make it work with small amount of data, then they will notice it doesn’t work well for their own purpose.

And I also tried Google’s Cloud AutoML, with a hope that it can learn well from small dataset. But I found that I cannot be sure that it works properly basically because it doesn’t disclose activations inside. It seems to output the result only when it is highly confident.

In this blog post, I will show how we can train image classifier with quite small dataset and be confident with the model.

What is needed to train with small dataset
Why we need ImageNet (or similar strong) pre-trained model
Why we need strong augmentation
(Comparison with AutoML → This will follow in the part 2)

Throughout this blog post, these are used:

Keras as ML framework (and I used tensorflow as backend
VGG16 as model architecture, just importing from Keras
Use accuracy as evaluation metric for simplicity.

1. Making very small dataset

We can find many dataset that have tons of images, but I need a small but practical dataset. so I created new one and released on GitHub. Here’s dataset and all the code used in this post:

GitHub daisukelab/small_book_image_dataset

It has 5 set of images of geek books, and some background images; we have 6 classes to train. One class of book has 9 or 8 images each, and background images have 29 images. Now we have 73 training images in total.

Regarding test set, we have two different small set. One is normal images from the same distribution of training set. This set has 6 images. As you can see, book band is moved up or down for making small changes from training set.

And the other is ‘difficult’ set captured from slightly different distribution; unseen objects, multiple books and a new book. This difficult set has 20 images.

Part of difficult test samples. Many images contain multiple books or difficult angles, or even occluded.

One last set is for training AutoML, this will be explained later.

2. Train from scratch, why not?

Let’s start by training a model from scratch.

Now we train without any techniques, no augmentation at all.

Trained from scratch, without augmentations

This is awful, totally useless classifier.
It is clear that small number of data as it is cannot be trained.

Let’s try augmentation.

3. We need augmentation

Small number of data can be virtually increased by augmentation techniques as widely used even when dataset is big.
Let’s see how it is improved by augmentation. Test has four levels of augmentation use.

No augmentation — as done above, closing eyes and finger crossed
Using usual augmentations — horizontal/vertical flip, rotation, zoom and random erasing are used.
Using mixup
Using all above

The mixup [1] is a technique that mixes two training examples, it’s powerful. Refer to the paper for detail.

The random erasing [3] or cutout [2] is a technique to fill rectangle part of image randomly, this works well also. Refer to the random erasing paper and the cutout paper for detail.

Here’s results:

Trained from scratch, with augmentations

Usual augmentation works.
Using mixup improves more.
We cannot save performance just by adding mixup.

With all augmentations, it performs perfect with easy tests that have similar images with training set. But it suffers with difficult practical test data.

This classifier might fairly work and acceptable for hobby use, but will not for real business. The accuracy with practical test 65% is insufficient, it fails 35% and it is like this:

Left result looks OK, but center and right results are awful. And this type of results won’t be accepted in business use.

Quality assurance check — CAM

By the way if we want to accept new technology in business, we need to make sure it is working ‘properly’ as a part of quality assurance. And proving properness requires explanation for how it is sure to work, but deep learning is basically hard to explain causality for how it makes predictions.

One help could be visualizing what have activated to predict a class, and CAM (Class Activation Map) is a technique which is explained well in the Keras book.

Let’s see how it is with the best scratch build model.

Left: activation map, Center: superimposed, Right: test image

This looks ok… no, it looks mostly at the right most side of image. Most activated part is on other book. (red ‘nlp’ book)

This is apparently bad, looking mostly at upper left area which is nothing to do with ‘geeks’ book.

Next example is awful, activation is on the ‘technium’ book but result is ‘background’. Mr. Kevin Kelly, I’m sorry.

These example shows that the model is not predicting reasonably, the model is not finding the book but rembembering how training examples were.

Training a model from scratch with small dataset is almost impossible even with strong augmentation.

4. Do normal: transfer learning with ImageNet pre-trained weights

Using ImageNet pre-trained weights and doing the transfer learning is what we normally do for small datasets. CNN will perform well, if it has good feature representations in its convolutional layers. And good convolutional layers are supposed to be achievable by training with huge amount of data like ImageNet.

Using ImageNet weights, just set to `weights` parameter in Keras.

base_model = VGG16(weights=’imagenet’, include_top=False, input_shape=input_shape)

Then here’s the results when train a model from pre-trained weights, but without any augmentation techniques again. (yes this is not normal…)

ImageNet based model, trained without augmentation

This looks good with easy test.
(And it would be something we see in many normal tutorials!)

But again, it doesn’t work in the practical difficult test. Too bad 35% accuracy.

Let’s move forward to use augmentation.

5. We still need strong augmentation, even with transfer learning

Tested as same as done above, here’s the result.

ImageNet based model, trained with augmentations

We need usual augmentation techniques, and critical +10% can be achieved by using mixup.

Let’s see why it is so critical. Here we check CAM results with 3rd and 4th model above.

ImageNet based model, trained with usual augmentation + mixup

ImageNet based model, trained with usual augmentation only

The first result shows activation happened on the predicted book correctly. And it seems that model used center picture part to classify.

But the second result, which don’t use mixup, shows that the model is distracted. It mainly looks at the edge of this image where it is just background. So the model is not classifying correctly by the appearance of the book, this can be an enough reason to drop this model or retrain.

These examples are interesting. Both models capture a lady drawn on the book and use it to classify, that’s very good. And the second result even captures entire this lady picture, from head to foot. It’s interesting. This could be because we used ImageNet pre-trained model and the ImageNet contains many images of people.

Summary

So far we saw many results of models with different training conditions.

It was trained with small dataset, but if we want to make sure building a strong model even with bigger dataset, following would be applicable:

We need ImageNet (or similar) pre-trained weights for its generalized potential.
We need augmentations as much as possible for pushing generalization during training.

All the code to get the results explained in this post are included in following GitHub repository. Codes are summarized in Jupyter notebooks, please visit and try by yourself.

GitHub daisukelab/small_book_image_dataset

Thanks for reading, this will continue with part 2.
It will show comparison with AutoML results.