Saksham Vikram
Jan 14, 2018 · 5 min read
Image for post
Image for post

This post Follows this repo


Stanford Dog Dataset has around ~20 k images belonging to 120 classes and each image has an annotation associated with it.First Thought, No. of images per breed available for training data which is rough ~180 images, which is very less by the account of the Data required to train a Convolution Neural Net(CNN) classifier.


First I trained a CNN network From scratch but the accuracy was not acceptable given the liitle data per class.The notebook Illustrating the experiment can be foud here

Since amount of Data we have is a constraint we will use the concept of Transfer Learning ,which being said refers to technique that allows you to use the pretrained models on your own Dataset. Ihave usedusedVGG16,VGG16BN(VGG16 with Batch Normalisation) models . VGG16 is a Deep CNN trained over Imagenet Dataset which has around 1000 synsets .

Image for post
Image for post

As, very well described in the paper Visualizing and Understanding Convolutional Networks the Bottom Layers of a Convolutional Neural Net activates only with primitive features (colour,texture,shape…) so these features can be transferred to other applications as well .Here we replace the top Layers(Fully Connected Layers and the softmax layers) and freeze the rest Layers so that they are non trainiable.Also we will make use of Synthetic Imge Generation to take into account the randomness of the images.

Setting up the dataset

Download the Dataset and extract ,crop the images with the help of annotaions provided . Now we split the Dataset into Training,Validation and Testing.This should be done carefully ensuring there is no class imbalance in various chunks.Dataset can be converted into tfRecords format as this allows faster input and output operations. I won’t be explainig that here,See this page for further references .You can use this Notebook for setting up the dataset which includes everything from downloading to making train,validation and test chunks.


First we use a CNN Network and train it over the training data it with the default parameters and Adam optimizer.

After 25 epochs:

Training Accuracy:94.07%

Test Accuracy:51.07%

Next we use Keras pretrained VGG16 model and replace the top with Fully connected layers and a softmax Layer of 120 units since we have 120 classes.Make Sure you preprocess the input image the very same way it is done in the VGG16 paper. Now since the Bottom layers are frozen ,To avoid unnecessary computaion we can pass the input images once and save the Bootleneck features (The output of the Last convulational layer) .These Bottleneck features are then fed into the top model and the network is trained.The Best Hyperparameters after tuning were :



After 50 epochs:

Training Accuracy :97.8%

Test Accuracy:40.23%

This clearly shows a presence of large variance or overfitting which can be mitigated by the use of Batch Normalization, L2 penalty or Dropout.Further we make use of Dropout And Batch Normalization with the help of the model VGG16BN .So, these regularizations methods gave us an 4% increase in the Test Accuracy but our model still seems to overfit the training data by huge amounts.Now it’s time to deploy our Image Data Genertor . You can about this in deapth from here.

Learning Rate:0.0001

After 20 epochs :

Training Accyracy:88.23%

Test Accuracy:76.53%


We trained our Data by cropping out the relevant part of the image using annotation files in Stanford dataset.Now while making prediction on a Random Image we can make use Of Object Detection Algorithms like YOLO to locate the Bounding Box of a Dog in picture and then feed the cropped image to your model.

Caution: The effect of Object Detection Algorithm on prediction accuarcy of the model depends on the Accuracy of YOLO

To Test the Accuracy Of Yolo ,we can make use of Annotaions in the Dataset images and Bounding Boxes obtained by the YOLO algorithm .


Notebook Depicting the implemntation of YOLO with NON-max Supression can be found in the GitHub Repo.


Aah!! Let’s see where our model is getting wrong . We can do this by generating a Plotting a Confusion matrix like this:

Image for post
Image for post

This doesn’t Look very clear so it’s very difficult to Visualise anything .Let’s see the top 30 misclassified pair of breeds.

Image for post
Image for post
Top 30 misclassified pair of Breeds.

As it can be seen, the pair ‘Silky Terrier / Yorkshire Terrier’ is the absolute leader in terms of misclassification which does make sense if we look into how these two breeds of dogs look like:

Image for post
Image for post
SILKY TRIER(image taken from Google )
Image for post
Image for post
Yorkshire Terrier(Image taken from Google)

This looks like the Optimal Bayes error rate which is Human in this case is itself very low because even humans confuse in this.For more details see this article


We see how can achieve a train a Decent model with even modicum data with the Help of Transfer learning.

Future Work:

Try out Different Pre-Tained models like Inception V3 Or Resnet.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store