Dogs Breeds Classification with Keras

Ilia Zaitsev
8 min readApr 9, 2018


Modern deep learning architectures show quite good results in various fields of artificial intelligence. One of them is images classification. In this post, I am going to see if one could achieve an accurate classification of images by applying out-of-the-box ImageNet pre-trained deep models from Keras Python package.

Note: The full notebook with dataset’s analysis and models training scripts could be found here.

Analysed Dataset

The analyzed dataset comes from the Dog Breed Identification competition hosted on Kaggle. It contains approximately 10,000 labeled samples belonging to 120 classes, composed from pictures from ImageNet dataset, and the same amount of testing data.

A sample of training images and their labels

Images have different resolution, zoom, or could have more than one dog depicted on them and were taken in various lighting conditions. The following diagram shows the number of samples per dog breed:

Distribution of dogs breeds in dataset (59 samples per class on average)

Classes are more or less balanced (i.e., not like having thousands of samples in one class and just a few in another), as the histogram shows, with 59 samples per class on average.

Bottleneck Features

One of the most straightforward strategies to apply pre-trained deep learning models is to use them as feature extractors. Before developing modern neural architectures, image features were extracted using manually crafted filters (like Sobel filter). Nowadays it is possible to derive features automatically, from data.

Note that all models and feature extractors mentioned in this post were executed on a single 1080Ti GPU. It could take several hours to extract features on relatively large (thousands of images) datasets when using CPU.

Schematically, a deep learning classifier could be represented as a sequence of blocks, transforming image representation from raw pixels into more and more abstract features like edges, contours, and so on.

As picture shows, a deep learning model without top layers generates a set of high-level abstract features for each shown image. These high-level representations are called bottleneck features, i.e., hidden layer’s activations taken from the model right before feeding them into dense layers. These abstract features, automatically inferred by deep model, could be used then with any “classical” machine learning algorithm to predict targets.

Therefore, to extract features from images, one needs to restore a deep network with pre-trained weights, but without top layers and run “predictions” on images.

The following snippet could be used to extract image features from (as it was mentioned previously, the full implementation could be found here):

Line 15 creates a Keras model without top layers, but with preloaded ImageNet weights. Then, lines 22–25 iterate through all available images and convert them into bottleneck features, saved into at line 29. Note that we do not load all available images into memory at once but create a generator instead that reads files in chunks from disk. (I’m discussing this moment a bit more thoroughly in this post).

Then, we can extract image features like the snippet below shows:

from keras.applications import inception_v3 extractor = FeatureExtractor(
extractor(folder_name, output_file)

Bootstrapped SGD

Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions. In our case, we would like to choose a logistic regression classifier to predict dogs breeds.

To make training process more stable and repeatable, we’re going to extend SGD with bagging, an approach that trains an ensemble of SGD classifiers on different subsets of training data and gives a final prediction by averaging responses from separate estimators. The following picture schematically shows the idea. (Note that in our case we not only splitting the original dataset into subsets but also taking different subsets of features to train each classifier):

Bagging with SGD classifier

This approach allows getting more stable and reproducible results, because if one trains a single SGD instance, its accuracy could have a high variance, i.e., change a lot from one training to another.

Also, before feeding extracted bottleneck features into classifier’s training method, a variance threshold transformer was applied to filter out features which values are to close to zero because it seems that feature vectors extracted by networks could be quite sparse.

The following snippets were used to train SGD ensemble and to compute prediction metrics:

Line 4 creates a single instance SGD classifier (you could read more about its configuration in scikit-learn package documentation) with a couple of regularization parameters and permission to use all available CPUs. Line 10 creates an ensemble of SGDs. Lines 15–20 train classifier and compute several performance metrics.

The following architectures were chosen to extract features which were used to train SGD classifiers:

  1. InceptionV3
  2. InceptionResNetV2
  3. Xception

Each classifier was trained on 9200 samples and validated on 1022 images. The table below shows prediction results achieved on training and validation subsets:

                      Training             Validation
------------------- ---------- -------- ---------- --------
network accuracy loss accuracy loss
------------------- ---------- -------- ---------- --------
InceptionV3 94.02% 0.3171 88.55% 0.4714
Xception 95.40% 0.2847 90.80% 0.4103
InceptionResNetV2 94.53% 0.1989 92.47% 0.3027

Not bad! These results doesn’t put you into the first leaderboard’s row, but having 92.47% accuracy on the dataset with 120 classes sounds like a good result, taking into account how quickly it was achieved using modern deep learning frameworks and architectures.

Pre-trained Models Fine-Tuning

Training an ensemble of SGD classifiers on bottleneck features has shown that these features allow achieving good prediction results. However, could we improve classifier’s accuracy with fine-tunining the original models by re-training top layers from scratch? Also, can we somehow preprocess our training set to make the model more robust to overfitting and to improve its generalization capability?

The purpose of the fine-tuning process is to adjust a pre-trained model to your data. Because in most cases, the model you’re going to re-use was trained on a dataset with a different number of classes. Therefore, you need (at least) replace a top classifying layer. Other layers could be “locked”, or frozen, i.e., during training new top layer their weights do not receive updates.

The process of new top layer training is not too different from the previous approach of using the network as a feature extractor. We use a different type of classifier (one-layer feed-forward network). The only difference is that in this case, we can apply data augmentation techniques. Each fine-tuned network was trained using slightly modified copies of images from original dataset (a bit rotated, zoomed in, and so forth). Using the previous approach, we would be required to store all augmented images somewhere before showing them to training algorithm.

As it was mentioned previously, please note that the neural network’s fine-tuning process could take up to several (dozens of) hours on relatively large datasets or datasets with high-resolution images.

Talking about implementation, Keras has a generator yielding augmented images for trained models. Also, there is an impressive library called Augmentor which has a rich list of image augmentation operators.

The following models were chosen to benchmark fine-tuning (almost same as for SGD training, except including one more architecture):

  1. ResNet50
  2. InceptionV3
  3. Xception
  4. InceptionResNetV2

Each model was trained during 100 epochs with early stopping and with 128 samples per batch using the same optimizer — Stochastic Gradient Descent with Nesterov momentum enabled:

from keras.optimizers import SGD
sgd = SGD(lr=0.001, momentum=0.99, nesterov=True)

The following augmentation paramers were chosen:

from keras.preprocessing.image import ImageDataGenerator
transformer = ImageDataGenerator(

And, here is what we’ve got with these models:

       Network        Val. acc   Val. loss   Public Score  
------------------- ---------- ----------- --------------
ResNet50 73.39% 0.9239 0.940900
InceptionV3 88.26% 0.3446 0.343280
Xception 90.31% 0.3132 0.341680
InceptionResNetV2 91.59% 0.2561 0.280520

Here Public Score column shows the loss value which was reported after submitting classification results to Kaggle

Well, not a significant improvement compared to results achieved previously, but we have tried to fine-tune a single fully-connected layer only which is not much different from our “shallow” classifiers we’ve trained on bottleneck features. Nevertheless, data augmentation and single layer perceptron on top of the pre-trained InceptionResNetV2 network has shown the best result among all classifiers trained during this analysis.

Example: Predicted Breeds

We’ve talked about loss and accuracy all the way down. Let’s see how actual predictions looks like using a few images not present in any of data subsets:

Running model on brand new dogs images (with one “spy” among them)

Finally, we’re getting (a probabilistic) answer to our question asked in the post’s heading picture. The picture is most likely shows an American Staffordshire Terrier.

Note that the model is not sure at all about the breed of cat from the bottom right picture. It means that we could use our model as a kind of “dog detector” (if we’re going to detect dogs similar to ones from training dataset, of course).


Out-of-the-box models (as feature extractors or with fine-tuned top layers and data augmentation) have shown quite good results in classifying images from dogs breeds dataset while requiring minimal efforts to be trained and to be used.

I believe that one could get much better results with networks mentioned above in case of adding more top layers, applying regularization techniques, trying various optimization algorithms, or “unfreezing” more hidden layers.

Nevertheless, I would say that it is a good idea to start with a simple baseground model like one shown in this post to set a “lower bound” on accuracy/loss metrics before trying more sophisticated solutions.
One of the drawbacks of this analysis is that selected dataset, as it was said previously, was build by taking canine class images from ImageNet. It means that we are running our networks on data which was probably seen by them already. In next post, I am going to pick a more interesting dataset to see how far could we go with modern libraries and pre-trained models.

Interested in Python language? Can’t live without Machine Learning? Have read everything else on the Internet?

Then probably you would be interested in my blog where I am talking about various programming topics and provide links to textbooks and guides I’ve found interesting.



Ilia Zaitsev

Software Developer & AI Enthusiast. Working with Machine Learning, Data Science, and Data Analytics. Writing posts every once in a while.