Ch 4. Transfer Learning with ResNet50 Part II — Model Analysis to Unexpected Riddle

Thinking about the Procedure >> Following the Procedure

8 min readSep 21, 2021

When solving a machine learning problem, a common practice is to first search for a widely-accepted procedure (if any) for the problem. While following that procedure, it’s important to constantly review if things are on the right track and analyze the results both mathematically and intuitively. In other words, it’s important to keep “thinking” than merely “following”. In this post, I tried logging my thought processes in analyzing a model trained for binary classification in image recognition.

In Chapter 3.1, I talked about fine-tuning ResNet50 for gun vs. knife binary classification task. These are some critical thinking questions I brought up at the end of Chapter 3.1 :

Do model results that we observe accurately reflect what the model is actually “thinking”? (Does the simple confusion matrix showing 100% recall for knife reflect that the model accurately picked up the shape of a knife? Does the model acknowledge a knife’ sharp tip, thin blade, and blunt handle while classifying each knife image into the knife class?)
Could we be prone to making human-centric assumptions about the model, overlooking the fact that this mathematical algorithm might perceive images far differently from how we perceive them? (If the model learned the shape of a knife, does it know that a knife is “sharp”? Does a model know what it means to be “sharp”? Maybe it looks at the angle between the two lines composing the sharp tip? A witch hat also has a sharp tip but not quite in the same sense as a knife. If a model did learn that a sharp tip is a prominent feature of knife, would it classify an image of a witch hat as a knife?)

Confused? Let me briefly explain what’s going on.

History

For my masters research project at University of Toronto, I was given airport Xray baggage scan images containing gun and knife to develop a model that performs automatic detection of gun and knife in the baggage. Given only a small amount of Xray images, I planned to use Domain Adaptation by (1) using a large number of normal (non-Xray) images of gun and knife from the internet to train a model and (2) adapting the model to perform well on the Xray images. In Chapter 2, I discussed how I collected thousands of images of gun and knife from Google, which will be referred to as “web images”.

Web images and Xray images containing knife and gun

Following the Procedure (F)

F1. Gun vs. Knife Binary Classification

Since I collected web images of two objects, gun and knife, without much thought I planned to train a gun vs. knife binary classification model. It was almost a “reflex” decision I made given that I had an image dataset containing two classes, like the case for cat vs. dog classification.

F2. Setting Up and Training the Model

The process from setting up dataloaders to fine-tuning ResNet50 on the web image dataset in PyTorch is shown in Chapter 3.1 post and my colab notebook. Only web images (no Xray images) of gun and knife were used for training ResNet50.

F3. Testing the Model on Web Images

Here’s the confusion matrix of the fine-tuned ResNet50 tested on web images test set:

With 98% and 100% recall for gun and knife, the model seems to be classifying between them quite well. Looking at this confusion matrix, I thought: “Oh wow! 🎉 The model must have learned the shapes of gun and knife pretty well.” .

F4. Testing the Model on Xray Images

I also plotted a confusion matrix of the fine-tuned ResNet50 tested on Xray images to see how a model trained only with source domain (web) performs on target domain (Xray):

Looking at 61.5% gun recall and 100% knife recall, few thoughts crossed my mind:

“The model must have learned the shape of knife really well.”
“Gun recall dropped sharply from 98% for web images to 59% for Xray images. Why? With visual comparison of correctly and incorrectly classified guns, I couldn’t spot much difference.”
“This result is a bit surprising because I felt like guns have a more unique shape than knives.”

Thinking about the Procedure (T)

After training and testing my first model, I did a quick project-scope sanity check to see if I was on the right track :

“I was given a task to build a smart algorithm for airport security that detects baggages containing gun or knife. So far I’ve trained a gun vs. knife binary classification model. But in general, most baggages at the airport should not contain gun or knife. When the binary classification model sees a safe baggage, is it able to predict that the baggage does not contain gun or knife? Since I think the model has learned the shape of knife well with 100% knife recall for both web and Xray images, it should be able to tell the absence of knife if it doesn’t see the knife shape. I could check this by testing the model with images that do not contain gun or knife.”

T1. Quick Review of Model Logits and Softmax Function

An image recognition model like ResNet50 predicts N probabilities of an input image belonging to each of the N classes, from which we select the class with the largest probability. This probability is calculated by inserting the model outputs (also called logits) into the Softmax function. The following image illustrates the process :

image source : https://developersbreach.com/convolution-neural-network-deep-learning/

The higher the softmax probability, the more “confident” the model is in classifying the image into that class. With this in mind, I predicted that if the model was given an image that do not contain gun or knife, it would output equally low logits for both classes, resulting in even probabilities around 50% for both gun and knife classes.

T2. Testing the Model on No-class Web Images

And I was stricken with a very surprising (daunting) result !!! :

Class probabilities for images that do not contain gun or knife

The model was nearly 100% confident in classifying a luggage and basketball as knife! Even when given a blank white image, it was 70% confident that it is a knife. How can this be so?!

T3. Testing the Model on No-class (Benign) Xray Images

In Chapter 1: Data Pre-processing for Xray images, I mentioned that the original Xray dataset given by the airport included baggages containing other objects such as hard drive, usb, and phone.

I decided to call them “benign” (i.e. safe) Xray images and use them to test how the model performs with images not containing gun or knife. Here is the resulting confusion matrix for 300 benign Xray images :

Although 93.7% of benign images were classified as knife, since we are taking the maximum of the softmax probabilities, even 49%/51% probabilities for gun/knife can still classify the image as knife. So to get a more accurate insight at the model behaviour, we can look at the softmax probabilities of each class that represent the model’s confidence in classifying the images into each class. I plotted two histograms using seaborn.distplot, which show the distribution of softmax probabilities for each class :

Histograms of softmax probabilities for gun (left) and knife (right) class

The left histogram for gun shows that the probability of the model classifying benign images as gun is on the lower side, mostly concentrated below 25%. It should still be brought down to all-zero to prevent false alarms. In contrast, the right histogram for knife shows a left-skewed distribution of softmax probabilities, illustrating that the model is quite confident in predicting most benign Xray images as knife. This is alarming, since most baggages at the airport will not contain a knife. A scanner with such a high false alarm rate would be very inefficient and congest a huge traffic at the baggage check.

T4. Project Riddle #1

Looking at the results from T2 and T3, tumultuous waves of thoughts propagated in my head 🌊 :

“Why? Why would the model so confidently classify an image of 🏀 or 🧳 as knife? Why would it think that the baggages that do not contain anything close to a knife to contain knife? Is this a problem inherent in the model architecture or data formation?

Was it far-fetched to think that a model trained for binary classification to have accurately learned the shape of gun and knife? After all, the model must have only learned a single decision boundary, a ‘shortcut’ to spot the most apparent difference between the two objects, rather than unique characteristics of each individual object.

I mean, it’s true that for this model, the whole world is composed of only guns and knives. It cannot think that anything else exists, like a 🏀 or 🧳. Hmm… But what I just thought could be another human-centric BS. After all, I didn’t teach the model the concept of “gun” or “knife”. It doesn’t know that a gun is a weapon that shoots out bullets, or that a knife is sharp that it can cut into skin. It wouldn’t even know anything about concepts like “sharp” or “dangerous”. It was only given a bunch of 224 by 224 numerical matrices and was told to reduce them into a single number: 0 or 1. It may have picked up some patterns in the images (that we humans will never notice) that offer a shortcut in differentiating between class 0 and class 1 images. It may have missed a lot of details (that we humans pay attention to when looking at a gun or knife) that may have clearly distinguished a knife from a basketball.

But I also think that looking at a confusion matrix or histogram is not the most in-depth analysis into the model behaviours. Since humans are visual creatures, I would like to look into more visual analysis that can offer some intuitions to clear up this riddle a little bit.”

This is how I started playing around with t-SNE plots, which I will talk about in the next post.

The code for all methods of model analysis discussed in this post is included in my colab notebook, feel free to take a look!

Thanks for reading! 😊

- L ☾₊˚.