How to Train Your ResNet: The Jindo Dog

An implementation of Transfer Learning, Grad-CAM and other techniques for image classification with a ResNet-50 convolutional neural network

Published in

Analytics Vidhya

10 min readFeb 11, 2020

In this article, we talk about Convolutional neural networks, Transfer learning, Bottleneck features, Grad-CAM and t-Distributed stochastic neighbor embedding. A lot of fancy words to sound smart. In layman’s terms, we will experiment with an image recognition algorithm and try to have it recognize something it was not trained to recognize: Jindo dogs. We will also look at a technique to help understand the algorithm’s decision, and another one to group similar images together.

The code and visualizations for this experiment are available here: Collecting images / Pretrained model / Transfer learning and Grad-CAM / Bottleneck features and similarity search / Deployment to Google AI Platform

A. Introducing ResNet-50 and Jindo dogs

Researchers have used up a lot of computational power training deep neural networks that classify images into categories. Many were trained on a subset of ImageNet (a huge database of 14 million images manually labeled with over 22,000 categories) as part of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) which, since it started in 2010, has seen a lot of improvement each year (the challenge ended in 2017). Each implementation differs in terms of architecture and performance, but all are what we call “Convolutional neural networks” (CNN), a type of deep learning network with excellent performance in detecting features in images.

Luckily, these pre-trained models have been made available for anyone to experiment with, saving us the time and expense of having to do the training from scratch. Among them, ResNet-50 (standing for “Residual Networks”) is a good compromise between speed and accuracy.

As with other models developed for the ImageNet challenge, ResNet-50 was trained on more than 1 million images distributed over 1,000 categories. These include no less than 189 breeds of dogs, from the Affenpinscher to the Weimaraner (two German breeds), far, though, from the 344 breeds recognized by the Fédération Cynologique Internationale (FCI).

Among the breeds that are missing from the ImageNet database: the Jindo dog, registered under no. 334 by the FCI, a spitz-type hunting dog originating from the island of the same name, off the southwestern coast of the Korean peninsula. According to Royal Canin, the Jindo is bold, brave, alert, attentive, exceptionally loyal, and “not easily tempted”.

So what does ResNet-50 see, when it sees a Jindo? Let’s try with a few pictures.

ResNet-50 top-2 predictions on pictures of Jindo dogs

Obviously, it doesn’t see a Jindo (yet). Out of 115 pictures tested, 24 were classified as “Eskimo dog”, 20 as “Dingo”, and 13 as “Kuvasz”, a Hungarian breed. Looking at those dogs, they indeed show some resemblance.

B. Transfer learning: ResNet-50 meets Jindo dogs

What if there was a way to teach the ResNet-50 model to recognize Jindo dogs (or any other breed that it was not trained to recognize) without retraining the entire model, which would require to download the 1+ million images and take days or months to train?

That is where transfer learning comes in. Transfer learning is a method to leverage the generic knowledge acquired by a model trained on massive datasets to develop a more task-specific knowledge using only a limited amount of new data. While generic knowledge is used to detect low-level features in an image, such as edges and curves, task-specific knowledge enables the model to recognize higher-level features, such as the shape of an ear, a nose or a muzzle, and, ultimately, distinguish one dog breed from another.

Generic knowledge is formed in the lower layers of the CNN (near the input image), while specific knowledge grows in the upper layers, near the classifier (the output), as illustrated by the following figure taken from the book Practical Deep Learning for Cloud, Mobile, and Edge by Anirudh Koul, Siddha Ganju, and Meher Kasam (O’Reilly, 2019).

As said before, the ResNet-50 model is able to identify a total of 1,000 categories. This is materialized by the fact that the last and topmost layer of the model (the output layer) is made up of exactly 1,000 nodes (or neurons), each outputting a probability that the image belongs to a certain category.

Now, how do we teach ResNet-50 a new category, such as the Jindo dog? While it would be handy to just add a 1,001st neuron to that layer and feed the network with a few pictures of the new category, that is not possible in practice. The classifier part needs to be entirely retrained on a small set of data representing each class. (For better results, if we have enough images for each class, we can retrain some of the convolutional layers as well. That’s what we did).

To summarize, we have to:

recreate the top layer (classifier) with one additional neuron for the new category,
retrain the upper layers with pictures not only of Jindo dogs, but also of other breeds, preferably in similar quantities in order to avoid class imbalance.

To facilitate our task, we downloaded a dataset from Udacity’s GitHub repository. It contains 8,351 images of 133 breeds. We also collected pictures of Jindo dogs on Google Images and Naver Images. The training process (retraining the top 8 layers of the ResNet-50 model, plus the new classifier part) took just a little over an hour on a machine equipped with a single GPU.

Let’s now put our fine-tuned model to the test, using the same set of images that we tried earlier. The following pictures were not used during training, so the model sees them for the first time:

Fine-tuned ResNet-50 top-2 predictions on pictures of Jindo dogs

Out of 10 test pictures, 7 were correctly classified as Jindo, 1 as Akita, 1 as Norwegian Buhund and 1 as Canaan, all from the ‘spitz’ family of dogs. Jindo is even the second guess for 2 of the 3 misclassified pictures. Not a bad result!

Of course, the model works for other breeds as well:

And for out-of-category breeds:

C. Grad-CAM: Whatcha lookin’ at?

A cool experiment is to visualize what parts of an image had the strongest influence on the model’s decision to classify the image in a certain category. This is one of the techniques of “Explainable AI”, a field in AI which aims at explaining the decisions made by “intelligent” systems. As more and more AI use cases affect people’s life (from adjusting insurance premiums to processing job applications), businesses that rely on those methods are being held accountable for their decisions.

In the field of computer vision, one way of looking into the “black box” of neural networks is with Grad-CAM, a technique which provides visual explanations of CNN-type model predictions by highlighting important regions in the image.

Let’s say you want to train a model to recognize Siberian Huskies, and most of your training pictures have snow in the background. There is a risk that the model learns the equation “snow = Siberian Husky” and classifies any dog picture with snow in the background as a Siberian Husky. Computing the Grad-CAM allows to make sure that the image was classified for the good reasons.

In the pictures above, Grad-CAM is applied to the last convolutional layer of the network. The first example shows that the model is paying attention to the dog itself rather than other objects in the scene. In the second example, it is interesting to see that the model is primarily looking at the region around the eyes. In other cases, it will be the muzzle or the ears.

D. Bottleneck features: ResNet-50 strips off

The role of a convolutional layer in a CNN is to extract features from images: low-level features, such as edges and curves, in the lower layers (near the input), and high-level features, such as the shape of an object, in the higher layers (near the output).

If we take a look under the hood by stripping the model of its topmost classification layer to extract the output of the last convolution layer, we get an abstract representation of the image as a vector of numbers (typically floating-point values between 0 and 1). For instance, the last convolutional layer of the ResNet-50 model outputs a matrix of size 7x7x2048 for each image, which is then reduced to a 2048-dimensional vector by averaging the 7x7 slice via a global average pooling operation. This is the feature vector.

1. Looking for similar pictures

The feature vector (also called bottleneck features because of its reduced dimensionality) can be compared with that of other images to find similar pictures. One way to achieve this is by calculating the Euclidean distance between vectors. Similar pictures will have their feature vectors ‘closer’ to each other. In Python, we can use the nearest neighbors algorithm from the scikit-learn library.

Computing the 2048-dimensional vectors for the 6,787 images in our dataset took less than a minute. Finding the ‘n-closest neighbors’ of a vector is even faster, provided that we keep the value of ’n’ low. Let’s see the result with 3 randomly selected pictures, along with their 3 closest neighbors and the Euclidian distance to the reference picture. Note that similar-looking dog pictures are most of the time, but not always, pictures of the same breed.

Feature vectors of similar-looking pictures have a short Eucledian distance

2. Visualizing clusters of images with t-SNE

What breeds resemble others? Now that we are able to evaluate the level of similarity between images, let’s try to find clusters of similar images. But how to visualize a 2048-dimensional space on a 2D screen? Thanks to another algorithm with yet another outlandish name: t-distributed Stochastic Neighbor Embedding.

According to Wikipedia, “t-SNE is a machine learning algorithm for visualization of high-dimensional data in a low-dimensional space of two or three dimensions. It models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability”.

In our case, the high-dimensional objects are the feature vectors with 2,048 dimensions each. Once again, scikit-learn comes in handy with its t-SNE algorithm. Computing t-SNE on 6,787 images took a few minutes. Plotting the results produces this beautiful Pointillist monochrome painting:

To understand what the different circles represent, we can plot some of the corresponding images. To achieve that, we borrowed Python code from the book Practical Deep Learning for Cloud, Mobile, and Edge.

After respacing the images, we get a much clearer picture:

Bichon Frisés and Bulldogs are at opposite ends, which is a sign that the experiment is a complete success. And as we could expect, our Jindo dogs hang out with more reputable folks.

E. A few metrics on the fine-tuned ResNet-50

1. Prediction accuracy

Training was performed on 80 % of the dataset, leaving 20 % for evaluation (validation and test). Data augmentation was used to expand the size of the training set by automatically generating modified versions of the images (applying transformations such as rotation, shift, zoom, etc.). Validation data was used to evaluate different algorithm parameters (hyperparameters), and test data to provide a final assessment of the model.

Tweaking the hyperparameters (dropout rate, learning rate, number of epochs, number of neurons in the fully connected layer) led to a marginal improvement in accuracy.

Accuracy: ratio of correctly predicted samples to the total number of samples

These figures are rather satisfying given the small amount of images we used for transfer learning. Stopping the training a few epochs earlier would have reduced the “overfitting” effect illustrated by the difference between training and test accuracy. Note that with 134 different categories, a random guess would achieve less than 1 % accuracy.

2. Confusion matrix

The confusion matrix (truncated here to show only the 14 last classes) allows easy identification of correct matches between the model prediction and the actual class (diagonal of the matrix), and incorrect predictions (all the other values). We can verify that 7 Jindo dog pictures out of 10 were correctly classified (the 3 others are not visible on the truncated matrix).

3. Classification report

Another useful tool to evaluate the performance of a classification model is the classification report, which shows the main classification metrics (precision, recall, F1 score) for the entire dataset and for each class.

Precision: proportion of predictions for a class that actually belong to that class
Recall: proportion of predictions for a class out of all samples from that class in the dataset
F1-score: weighted average of precision and recall

Regarding the “Korean Jindo” class, we got a precision of 78 % and a recall of 70 %, which means:

78 % of dogs that were classified as Jindo dogs are actually Jindo dogs. The remaining 22 % were mistaken for Jindo dogs
70 % of all Jindo dogs were correctly classified (that is 7 pictures out of 10, since we had 10 pictures of Jindo dogs in the test set).