Dog Breed Classifier using CNNs

Ankit Kumar Saini

Published in

Nerd For Tech

10 min readApr 20, 2021

The capstone project for Udacity’s Data Scientist Nanodegree Program

Project Overview

The goal of the project is to create an end-to-end deep learning pipeline to classify images of dogs according to their breed. The pipeline will accept any user-supplied image as input and will predict whether a dog or human is present in the image. If a dog is detected in the image, it will provide an estimate of the dog’s breed. If a human is detected, it will provide an estimate of the dog breed that is most resembling.

Problem Statement

In this project, I am provided with RGB images of humans and dogs and asked to design and implement an algorithm that can detect humans (human detector) or dogs (dog detector) in the images. After detecting a human or dog, the algorithm further needs to predict the breed of the dog (if the dog is detected) and the most resembling dog breed (if a human is detected). If neither is detected in the image, the algorithm should ask the user to input another image containing either dog or human.

Metrics

To evaluate the performance of my algorithm, I used classification accuracy as the performance metric. All three deep learning models human detector, dog detector, and dog breed classifier (more on these later) were evaluated using the accuracy that these models have obtained in classifying the images.

Accuracy is a reasonable choice of performance metric for these models. This is because the human detector model is provided with 100 images of humans and 100 images of dogs (balanced data) to evaluate its accuracy. Similarly, the dog detector is provided with 100 images of each human and dog to evaluate its accuracy. The dog breed classifier model is validated on the validation set to find its generalization performance and then finally evaluated on the test set.

Other performance metrics that could have been used are F1-score, AUC-ROC curve, and confusion matrix.

Data Exploration and Visualization

Two different datasets are used in this project. The first dataset is Labeled Faces in the Wild containing 13233 human images. All images are 250x250 pixels in height and width. The majority of the images in the dataset have a single human face in the image but there are some images that have an additional half face. This shouldn’t be a problem for the human detector model in detecting humans in the images because face detector models are robust to small random noise present in the image.

The second dataset is the dog breed dataset containing 8351 dog images with 133 dog breed categories. The dataset is not perfectly balanced. The mean number of images in each class is around 50. But there are few classes in the dataset that have less than 30 images while there are some classes that have more than 70 images. This small imbalance in data could pose a problem in training the dog breed classifier model. But this could be taken care of by over-sampling the minority classes or under-sampling the majority classes and data augmentation methods.

Few random samples of dog breeds from the training data are shown below

Project Roadmap

The entire project has been divided into 9sections (Step 0–8).

Step 0: Import and Preprocess Datasets
Step 1: Implement Human Detector
Step 2: Implement Dog Detector
Step 3: CNN to Classify Dog Breeds (from Scratch)
Step 4: CNN to Classify Dog Breeds (using Transfer Learning)
Step 5: Write the Algorithm to predict dog breeds
Step 6: Test the Algorithm
Step 7: Conclusion
Step 8: Tips to improve the performance

Step 0: Import and Preprocess Datasets

All CNN models in Keras require a 4D array/tensor as input with shape (batch_size, image_height, image_width, num_channels). The shape of each image needs to be the same for training the CNN model in batches. Therefore the input data for the dog detector model and dog breed classifier model needs to be reshaped so that all the images have the same shape.

Getting the 4D tensor ready for any pre-trained CNN model in Keras, requires some additional processing. First, the RGB image is converted to BGR by reordering the channels. All pre-trained models have the additional normalization step that the mean pixel (expressed in RGB as [103.939, 116.779, 123.68] and calculated from all pixels in all images in ImageNet) must be subtracted from every pixel in each image. This step can be easily implemented in Keras using the preprocess_inputfunction.

The images feed into the human detector model need to be converted into grayscale from RGB format.

Step 1: Implement Human Detector

I used the pre-trained Haar cascade face detector model from the OpenCV library to determine if a human is present in the image or not. The code to detect human faces using Haar cascade is very simple and straightforward. First, create an instance of the model with pre-trained weights, read RGB image into memory and convert it into grayscale. Then, use the detectMultiScalefunction that executes the classifier and takes the grayscale image as a parameter, and returns the coordinates of the human face bounding box. The code snippet and detection results are shown below.

Step 2: Implement Dog Detector

To detect the dogs in the images, I have used a pre-trained ResNet-50 model. This model has been trained on ImageNet, a very large and popular dataset used for image classification and other vision tasks. This model takes an input image as a 4D tensor and provides as output one of the 1000 categories for the object that is contained in the image, with several of these categories being dogs of different breeds. Therefore ResNet50 can be used as a dog detector. If the class of the image falls into one of the dog breed categories, then a dog is present in the image.

The detection of human and dog in the images is crucial for the project as I am going to use these models along with the dog breed classifier.

Step 3: CNN to Classify Dog Breeds (from Scratch)

Before jumping to transfer learning directly, I build a CNN model from scratch. The model is simple and it is neither too deep nor too shallow. It has five blocks of Conv2D layer followed by MaxPooling2D layer. I added a dropout layer after every two blocks of Conv2D and MaxPooing2D layers to avoid overfitting. The code to create a Sequential model in Keras along with its summary is shown below.

I trained the model for 6 epochs with a batch size of 20 without any data augmentation. The model didn’t perform well and achieved only 6% accuracy on the test data.

Step 4: CNN to Classify Dog Breeds (Transfer Learning)

What is Transfer Learning? Transfer learning is a machine learning method where a model trained for one task can be used as the starting point for another model on a similar task. It is a very popular technique in deep learning to get really good performance with less computation even on very small datasets.

I used five different models with pre-trained weights to classify dog breeds. The models include VGG16, VGG19, InceptionV3, ResNet50, and Xception. All these models are trained on the ImageNet dataset. The images in the dog breed dataset are similar to the images in the ImageNet dataset and hence transfer learning would be suitable for this task. These pre-trained models can act as very good feature extractors from images and can be used in this breed classification task by transferring what the models have already learned.

Steps to build a classification model using a pre-trained model

Use all the layers of the pre-trained model except the classification head, as a feature extractor.
Freeze the weights of the pre-trained model so as to prevent them from degrading during the training process.
Add few new trainable layers on top of the frozen layers. The first layer could be a GlobalAveragePooling2D layer or a Flatten layer. You can also add a Dropout layer to avoid overfitting and BatchNormalization layer for smooth training.
Build the new model. You can use Sequential or Functional API from Keras.

I used a GlobalAveragePooling2D layer followed by a Dense classification layer with softmax activation that calculates the probability of each dog breed, on top of the pre-trained model.

Evaluation of models

When trained without data augmentation, all the models (VGG16, VGG19, InceptionV3, ResNet50, Xception) did overfit on the training dataset. The training loss kept on decreasing with every epoch while the validation loss decreased very slowly (seems saturated) and is much higher than the training loss. Similarly, the training accuracy is much higher than the validation accuracy. The table below summarizes the results of all the models on training and validation data sets.

Of all the models trained, the Xception model performed the best on the validation dataset. It achieved an accuracy of 81.32% on the validation data while the accuracy of other models was below 70% on the validation data. The graph below shows the training and validation loss and accuracy curves for the Xception model without data augmentation.

Refinement: Training Xception model with Data Augmentation

In order to improve the performance of the Xception model, I used data augmentation techniques such as random rotation, flipping, shearing, zooming operations on images using Keras ImageDataGenerator. This reduced the overfitting and the accuracy of the model reached 81.7% on the validation dataset. The code for data augmentation and the loss and accuracy plots on training and validation data are shown below.

To further improve the performance, I unfroze the last two blocks of the Xception model to fine-tune them. As a result, the accuracy of the model reached 83% on the validation dataset and 83.5% on the test dataset. The model predicted the majority of classes without confusing them with other classes. However, there are few classes in the test data that are not correctly classified by the model. The confusion matrix generated by the model on the test data shows these results.

Step 5: Write the Algorithm to predict dog breeds

The final algorithm uses the human detector model (Haar cascades) and the dog detector model (ResNet50) to detect that a human or a dog is present in the image. Once the presence of a human or dog is confirmed in the image it is passed to the dog breed classification model (Xception model) to determine what breed the human or dog most resembles. The model predictions are printed out along with whether a human or dog was detected. If the image was found to contain neither, a new image with a human or a dog is requested.

Step 6: Test the Algorithm

The results of the algorithm are shown below:

Algorithm results on human and dog images.

2. Algorithm results on images without human or dog.

Conclusion

This project serves as a good starting point to enter into the domain of deep learning. Data exploration and visualizations are extremely important before training any Machine Learning model as it helps in choosing a suitable performance metric for evaluating the model. CNN models in Keras need image data in the form of a 4D tensor. All images need to be reshaped into the same shape for training the CNN models in batch.

Building CNN models from scratch is extremely simple in Keras. But training CNN models from scratch is computationally expensive and time-consuming. There are many pre-trained models available in Keras (trained on ImageNet dataset) that can be used for transfer learning.

The most interesting thing to note is the power of transfer learning to achieve good results with small computation. It works well when the task is similar to the task on which the pre-trained model weights are optimized.

Tips to improve the performance

Get more images per class
Make the dataset balanced
Use image augmentation methods such as CutOut, MixUp, and CutMix
Use VAEs/GANs to generate artificial data
Use activation maps to interpret the model predictions
Use deep learning-based approaches to detect human faces (MTCNN)

Hope you enjoyed reading this post.

You can find all the code in my GitHub Repo

LinkedIn handle: ankit-kumar-saini