Ch 6. Optimizing Data for Flexible Image Recognition

How can we adjust input data and labels to encourage neural networks to “perceive” images flexibly as humans do?

12 min readOct 10, 2021

Flexibility of Human Perception 👼

**Visual Inputs:** toy blocks, (moving) person, (my) hands, (my) arms, floor, wall

Since we were little, we have learned about the world by observing and interacting with diverse stimuli through our five senses. Human perception involves continuous naming, characterizing, and remembering things in the environment while referring to the database of “what I know”. If I see something close to a particular category in my database, I recognize it. If it’s nothing like anything in my database, I add it as new data. Years of such continuous bottom-up learning has naturally given us flexibilities in perception such as :

Recognizing the absence of an object in the environment
Recognizing multiple objects at the same time
Recognizing the shape of an object across different mediums (e.g. real life vs. sketches)

These also make us robust, capable of taking different perspectives according to the context of the visual stimuli.

Inflexibility of Neural Networks 🤖

How about a neural network trained for vision? These days it’s easy to download an ImageNet pre-trained CNN network and fine-tune it to classify among different classes of images. But does it have the flexibilities in perception as listed above? Can we expect it to naturally infer something outside the given data? How about outside the given format of data? The answer is NO. Their “perception” abilities are strictly limited to the bounds of :

Given input images (x)
Given format of target labels (y)
Given task (loss function, number of classes)

Let me repeat the main idea from my post discussing the central role of data in machine learning. When I train a machine learning model, the model is only interested in mapping the data (x) given by me, to a particular form of labels (y), which is also given by me. As a machine learning engineer, I am responsible for setting up the environment and the props for the model. The computer is only the computing hardware. So what can I do to make the model perceive images more flexibly? I can optimize the data I am giving it.

Flexibility Training for CNNs — Data Optimization

In this post, I will discuss how such bounds restrict CNNs in visual perception that may seem trivial for humans, which can lead to inflexible and often unintuitive behaviours of the model. I will also present a remedy I found for each inflexibility by annotating or optimizing data in different ways. Here are the three inflexibilities I will discuss :

Can’t recognize the absence of an object → Include “None-of-the-above” class
Can’t recognize multiple objects at the same time → Use multiple labels for each image
Can’t recognize the same object shape across different image textures→ Encourage shape bias instead of texture bias using Stylized ImageNet

I will also share the motivation and application of each data optimization method for the automatic threat detection project for my masters research at University of Toronto. The image recognition model I will keep mentioning is trained only with normal camera (web) images then tested with Xray images (due to only a small amount of Xray images available). For detailed background, please refer to this project introduction post and the list of my project posts.

My model is trained only with normal camera (web) images then tested with Xray images, due to only a small amount of Xray images available.

Inflexibility #1: Can’t recognize the ABSENCE of an object

(1.1) Why are you classifying a basketball as knife?

I had a ResNet50 model that was :

Trained for gun vs. knife binary classification
Trained only with images of gun (class 0) and knife (class 1)
Showed 99% accuracy for test images of gun and knife

Take a look at the following predictions of the same model :

A gun vs. knife binary classification model confidently classifies most images containing neither classes as knife.

I mentioned this problem in my post about transfer learning with ResNet50, where the model confidently classifies most images containing neither gun or knife as knife. Contrary to my expectation that it would output something close to 50/50 (indecisive) probabilities for both gun and knife classes, the model geared so much towards knife when classifying a completely unrelated image. Why?

(1.2) Black-and-white decision boundary

As discussed in my post about t-SNE plots, I got an intuition that a binary classification model learns a “black-and-white” decision boundary. It only checks if a test image lies on the left side or the right side of the boundary. Thus it can only give us the LEFT (class 0) or RIGHT (class 1) option, unable to recognize the absence of either object even if either object shape is not present in the image. Seeing an image of a basketball, for example, the model is absolutely incapable of thinking :

“Oh, this doesn’t look like anything I’ve seen before so I can’t classify into one of the two classes. Let me adjust my last fully connected layer’s number of output features from 2 to 3. This way I can classify it as the third ‘neither gun nor knife’ class”.

(1.3) Looking at the context of the problem

This wouldn’t be too much of a problem if we plan to use the model in a situation where the incoming images will strictly belong to one of the two classes. But what if the model could receive any type of images and be expected to predict that the image contains no classes?

(1.4) Xray scanner that classifies every baggage as harmful

In general, most baggages scanned at the airport security would not contain a gun or knife. When I tested the same gun vs. knife binary classification model with 300 benign (not containing gun or knife) Xray scan images; however, 93.7% was classified as knife with high confidence as shown below.

Confusion matrix (left) “knife” class prediction confidence (right) for benign Xray images

Having such a high false alarm rate of detecting knife in almost every bag would be unrealistic and very inefficient.

Remedy #1: Introduce “NONE-OF-THE-ABOVE” Class

To fix this inflexibility, I introduced a third “benign” (neither gun nor knife) class representing all objects in the world that are not gun or knife. To collect training images for this class, although there exist infinite things besides gun and knife, I tried to use objects that might be found in airport baggages. The final search keywords I used for scraping images from Google include: book, car, wire, water bottle, tape, yarn, speaker, box, and sunglasses.

Examples of collected images for benign class

(1.5) Performance on Xray images: High MISS

Here is the comparison of model performance on Xray images before and after introducing the benign class. Recall that the model does not use Xray images during training (details).

Confusion matrices for 2-class and 3-class classification for Xray images

Gun/Knife Recall Table for Xray images (V1)

The results are far different with only 25% and 18% of guns and knives detected, turning from high false alarm (2-class) to high miss (3-class). Miss is highly undesirable in threat detection, since we don’t wanna miss someone who is carrying a gun or knife in her bag!

(1.6) Looking Back at Bata

Trying to find the reason for high miss, I looked at the input images :

Xray images and web images containing knife and gun

As shown above, most Xray scan images of baggages contain many other objects besides gun or knife, while most web images show a distinct, isolated presence of the object. Thus when the model classifies Xray images as benign, it may be looking at the other objects rather than gun or knife. It is unable to consider the possibility that there can be more than one classes present in the image, e.g. a benign object and a gun. This becomes another fatal inflexibility of an image recognition model.

Inflexibility #2: Can’t recognize MULTIPLE objects at the same time

(2.1) Do you not see the knife?

This inflexibility applies to web images as well as Xray images. Below shows how the same model is only recognizing the gun in an image that is clearly containing both gun and knife.

Gun vs. Knife binary classification model not recognizing knife in the image

The model has might as well been trained for gun vs. not-gun binary classification rather than than gun vs. knife. Its insensitiveness to the presence of knife is also quite alarming.

In general, the most accepted data annotation for training an image recognition model is assigning a single-number target label for each image from 0 to N-1, where N is the number of classes. This however restricts the model in classifying the image as only one of the available classes. How could we annotate the data differently so that the model can tell if there are multiple classes of objects present in the image?

Remedy #2: Annotate each image with MULTIPLE class labels

The solution I came up with was: instead of making the model classify the image as one of N classes (single-labels), I made it predict N probabilities of the image containing EACH of the N classes (multi-labels). This table summarizes the differences in annotating images with single vs. multi target labels :

Using single labels vs. multi labels for data annotation

Number of Target Labels

Using single-labels, the target label for an image is a single number (0, 1, 2, etc.), which represents one of N classes. For multi-labels, the target label is a list of N binary numbers (0 and 1), where 1 means the class is present in the image and 0 means it isn’t.

Soft Label

For my threat detection problem, I found it beneficial to use soft label for benign class due to the model’s tendency to classify images as benign with higher confidence compared to gun and knife classes. Since detecting benign objects were not as important as detecting gun or knife, I made the benign signal weaker, replacing the benign class label with 0.5, while keeping others as 1.

Loss Function

Using single-labels, the model classifies an input image into a single target class. Thus we take softmax activation for the logits coming from the last fully connected layer, which is put into computing cross-entropy loss during training. Using multi-labels, the model predicts the probability of the image containing EACH class object. Thus the softmax activation is replaced with sigmoid, outputting a probability between 0 and 1 for EACH target class. Cross-entropy loss is changed to binary cross-entropy (BCE) loss for multi-dimensional targets. Since I used a soft label of 0.5 for benign class, I used MSE loss instead of BCE loss.

(2.2) Re-labelling of all input images

Three different types of images and respective target labels

I re-annotated the target labels of all input images in order to distinguish between images that contain a single, isolated object and ones that contain other benign objects in the background. For example, image (a) above was given a label of [0, 0, 1], which represents [P(benign), P(gun), P(knife)] with only P(knife)=1 since the image only contains a knife without any other objects. Image (b) with knife and other benign objects such as mushrooms🍄 and meat🥩 is given a label of [0.5, 0, 1] with P(benign) of 0.5 as soft label and P(knife) of 1. Same thing applies to image (c) with both gun and knife classes present.

(2.3) Improved recalls for gun and knife in Xray images

Here is the comparison of performance on Xray images for models trained with single-label and multi-label data. You can see that the recalls for both classes nearly twice increased.

Confusion matrices for Xray images for models trained with single-label and multi-label data

Gun/Knife Recall Table for Xray images **(V2)**

For the model trained with multi-label data, I used a prediction threshold of 0.3, meaning that if the predicted logit for a class were larger than 0.3, I considered the image as belonging to that class. This way I can also predict multiple labels for a test image if more than one classes’ logits exceed 0.3.

The right confusion matrix shows that the model trained with multi-label data classified a majority of gun and knife images as benign, which is fine because there are indeed many other benign objects present in most Xray images.

Inflexibility #3: Can’t recognize the same object SHAPE across different image TEXTURES

(3.1) Texture Shift

Since there exists an apparent texture shift from web images to Xray images, it would be optimal to ensure that the model is properly learning the shape of each threat object such that texture shift will not affect its performance.

(3.2) Texture Hypothesis / Texture Bias

Take a look at the following figure showing the accuracies of 4 different CNN architectures and humans for classifying the first four images as “cat” and the last image as “elephant”:

Accuracies of CNN architectures and humans for classifying the first four images as “cat” and the last image as “elephant” (Source: https://arxiv.org/abs/1811.12231)

This result is presented in a paper named ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness by Geirhos et al. published in 2019. Most humans can easily recognize the first four images as a cat despite the style/texture shift. In contrast, the accuracies of AlexNet, LeNet, VGG16, and ResNet50 fall sharply when the texture is changed to silhouettes and edges. This illustrates Texture Hypothesis : (quoting from the same paper)

Object textures are more important than global object shapes for CNN object recognition. Local information such as texture may actually be sufficient to “solve” ImageNet object recognition.

This idea is reflected in 100% accuracies of all 4 CNN architectures for classifying the the last elephant skin image as elephant. If so, how can we make the model more sensitive to the shape of the object instead of the texture?

Remedy #3: Increase SHAPE BIAS in the model using Stylized ImageNet dataset

As a remedy to the texture bias, the same paper suggests training the model with stylized images from ImageNet dataset. “Stylizing” an image means :

Keeping the content/shapes in the image
Replacing the styles/textures in the image from a randomly selected painting from Painter by Numbers dataset (containing 79,434 paintings) using AdaIN style transfer

Here’s an example of an image of lemur stylized with ten different paintings.

**10 Stylized samples of an image of class “*ring-tailed lemur”*.** The samples have content/shape of the original image on the left and style/texture from different paintings (Source: https://arxiv.org/abs/1811.12231)

The stylized images retain the lemur’s shape outline, while having diversified textures. This makes the local texture cues no longer highly predictive of the target class, forcing the model to focus more on the global shape of the object. The paper names the stylized dataset as Stylized ImageNet (SIN) and the original ImageNet images as IN. The experiment results for training with either or both SIN and IN are shown in the following table :

Accuracy comparison on the ImageNet (IN) validation dataset & object detection performance (mAP50) on PASCAL VOC 2007. All models have an identical ResNet-50 architecture.

For my threat detection problem, I used the best performing model (last row) pre-trained on both SIN and IN then fine-tuned on IN. The code instructions for downloading model checkpoints are available in the paper author’s github repository. I once again fine-tuned the model on the web images of benign, gun, and knife classes using multi-labels. I also tried stylizing my own images of gun and knife; however, the validation result was not as good as using the original images.

(3.3) Improved Model Performance on Xray images

Here is the comparison of performance on Xray images for models trained with the original and Stylized ImageNet. You can see that the recalls for both classes increased by far to 71% and 73%.

Confusion matrices for Xray images for models trained using original ImageNet (left) and Stylized ImageNet (right)

Gun/Knife Recall Table for Xray images **(V3)**

Although the recalls went up to the 70s, this is nowhere reliable for a threat detection system. I will discuss how I used domain adaptation to raise the bar even higher in my future posts.

Closing

We looked at three methods of data optimization, which encourage a CNN-based image recognition model to make more flexible decisions that fit the context of the given problem. Some readers might wonder why not just use an object recognition or semantic segmentation model that can perform more complex tasks. You can, but they require a much larger, more complex model architecture and expensive data annotation (consider pixel-level labelling for semantic segmentation).

Optimizing data is like learning how to take creative photographs by experimenting with different lights and angles, rather than buying a more expensive camera. I could get good photos with an expensive camera that sets/optimizes every setting for me, but I’ll never get the skill of toggling different variables to create the feeling I want in my photos.

Me captured by the light magician Andre 🌙

You can contact me for any questions or feedbacks about my approaches. I would love to know what other ML researchers think. Thanks for reading! 🌸

- L ☾₊˚.