The Dog Whisperer

7 min readJun 29, 2020

Someone please tell me what breed this is…

The Challenge

The problem in hand was simple one but had long troubled humanity — what kind of dog is that?

The goal was to harness the power of deep learning and develop an algorithm capable of not only recognizing our canine friends in images but also identifying the breed correctly. This algorithm laid the foundation for a mobile or web app that could accept a user-supplied image and provided a result instantly. With that humanity would no longer mistake Whippets for Greyhounds.

In addition, the mobile or web app would also be able to detect human faces and determine dog breeds most similar to these particular faces. One could finally once and for all know for certainty if Samuel L. Jackson really looked like a pug.

Visualization & Exploration

The dataset for training, validating, and testing was kindly arranged by Udacity as part of the Data Science Nanodegree Capstone project. The set consisted of 8,351 dog images covering a total of 133 different breeds. A separate set of 13,233 human images was also provided.

In the dog image dataset, the average number of image for each of the 133 breeds was 133 with a standard deviation of 15. Since no statistical outlier was observed, the dataset did not require further rebalance on the sample sizes of individual breeds. A single image in the process was found to be blanket and consequently removed from the dataset.

Image Counts based on breeds — Image counts based on breeds

The average image resolution based on the 133 different breeds was 400k with a standard deviation of 170k. Despite having a relatively wide standard deviation and 4 statistically outliers in the group, their impact on learning quality was likely very minimal, especially when compared to that of uneven distribution of image counts.

An incremental approach on experimenting and analyzing the individual components of the algorithm would be taken, before aggregating them into a functioning whole.

Metrics

Given the composition of image data as well as its structure, it was deemed appropriate and achievable to set the metrics as follows:

Recognize dog/human in an image and identify the respective breed with a 75.0% accuracy rate. 75.0% was a reasonable comprise between usability and presence of mixed-breed dogs.
Capacity to complete the above in under 30 seconds considering future deployment as a mobile app.

Methods

Human Face Detection

A simple and relatively reliable tool for human recognition was OpenCV’s Haar feature-based cascade classifier. The image of interest was first converted to gray scale before being processed by OpenCV. The result was a set of coordinates and dimensions of the bounding box that highlighted the human face detected.

A random sample of 100 human images was fed to the classifier and OpenCV was able to detect all 100 cases. However, when a random sample of 100 dog images was analyzed, OpenCV mistook 11 out of the 100 as human.

Dog Detection

To better differentiate dogs from humans, the pre-trained ResNet-50 model with weights from ImageNet was adopted. Equipped with a neural network of 50 layers, ResNet-50 proved to be an effective tool in isolating dogs from humans. When given the same test of 100 dog and human images respectively, ResNet-50 was able to make the correct distinction in all cases.

Dog Breed Identification

ResNet-50 was not only an effective tool for canine cognition, but it was also highly accurate and efficient at identifying breeds. However, a convolution neural network (CNN) built from scratch to compete with ResNet-50 was more fitting for learning and the purpose of this challenge.

CNN was inspired by the biological process of visual cortex where individual neurons responded to stimuli only in a restricted region of the visual field. Instead of reacting to electrical or chemical signals, neurons in a neural network consumed tensors that contained information (coordinates and colors) of image pixels. This process nevertheless could be quite demanding in terms of computational resources, and therefore CNN introduced convolution layers to further divide the image into smaller, more manageable patches. The entire operation resembled that of using a magnifying glass to examine a painting bit-by-bit and piecing together all the information to construct a bigger picture.

An illustration of convolution neural network

The basic architecture of the CNN in our first attempt to identify dog breeds contained 8 individual layers with 3 iterations. Three activation layers set to ReLU (rectified linear unit) were also embedded in the convolution layers to create non-linearity and improve computational efficiency. The final output was array of 133 different breeds that aligned with the initial parameter.

After training and validating the CNN with 5 epochs, the model was able to identify dog breeds with only a 2.6% accuracy rate. Obviously, more polishing and tuning would be required to enhance the CNN’s reliability.

Convolution Neural Network + Transfer Learning

In the previous section, both OpenCV and ResNet-50 proved to be reliable for the tasks assigned: recognizing human and dogs respectively. To further improve the capacity of the CNN, however, transfer learning would be adopted to replace training from ground zero.

Transfer learning overcame the shortcoming of isolated learning and allowed the use of knowledge acquired from prior tasks to solve future, related problems. For instance, a model trained to track player movements on the basketball court could also be extended to follow dancers’ motion in similar settings.

Keras offered an extensive list of deep learning models with pre-trained weights, and in this case InceptionV3 was selected to be the transfer learning model of choice. As seen above, the architecture was much more concise having only 2 layers. The runtime was significantly shortened, even though the number of epochs was increased to 20 during training and validating. As a result, the new CNN with transfer learning was able to achieve a much improved 80.0% accuracy rate.

Putting Everything Together

To sum up all the components discussed so far, an image of interest would first go through ResNet-50 in the algorithm. If a dog was detected in the process, the transfer learning CNN would determine the breed. If a dog was not detected in the process, OpenCV would then attempt to recognize any human in the image. In a case which OpenCV was able to recognize a human, the transfer learning CNN would determine the most similar dog breed to that particular human. If OpenCV was unable to detect any human, the algorithm would notify the user neither a dog nor human could be found in the image.

Results

The algorithm was able to meet and exceed the two main evaluation metrics established previously with a 80.0% accuracy rate and average completion time of under 30 seconds (first input in a series multiple test images). The performance on execution time did nonetheless experienced a noticeably decay with slightly less computational resources available after each test sample. Considering most human struggled with identifying dog breeds, the algorithm at times might appear to be more powerful than its 80.0% accuracy rate. It would most likely have to take an expert in the field to challenge the algorithm.

However, when it came to identify dog breeds that best represented humans in the images, the algorithm seemed to favor certain breeds. During random trials, one might expect Dachshund or Cocker Spaniel to be the prediction more often than usual. Such bias could potentially caused be by the close resemblance in features between human and the two breeds that were not apparent to bare eyes.

When given images with no dog or human, the algorithm performed well in most cases but failed to differentiate real human from human-like objects, such as statues and portraits. Overall for the purpose of this challenge, the algorithm proved to be sufficient in identification of dog breeds.

Reflection & Improvements

The model was trained with a sizable pool of images and integrated with a pre-trained model for better efficiency and accuracy. A few modifications on the two processes above could potentially lead to better performance:

Intuitively one could train with a large pool of images to provide more data for reference. The quality of the training images could also be improved at the cost of computational resources.
The training images could be augmented through random transformations so the model would never see the same image twice. As result, the model could avoid overfitting and generalize better.
Fine-tuning the hyperparameters like batch size, validation settings and weights could sharpen the response of the algorithm.

Another critical aspect of the CNN was to strike a delicate balance between accuracy and efficiency. With the assistance of GPUs, one might be able to train and deploy a model relatively quickly. Nonetheless, one might need to take into account the face that GPUs would not always to accessible on a mobile or web app.