Opening the Black Box: How Neural Nets See our World

Published in

Attentive AI Tech Blog

5 min readApr 21, 2019

Convolutional Neural Networks (CNNs) and other deep networks have enabled unprecedented breakthroughs in a variety of Computer Vision tasks, ranging from Image Classification (classify the image into a category from a given set of categories), to semantic segmentation (segment the detected category), image captioning (describe the image in natural language), and more recently, visual question answering (answer a natural language question about an image). Despite their success, when these systems fail, they fail disgracefully, without any warning or explanation, leaving us staring at an incoherent output, wondering why it said what it said. Their lack of decomposability into intuitive and understandable components makes them extremely hard to interpret.

Let us consider a situation where we want to classify between a car and a ship in satellite imagery. Let’s say that we have the data, and trained a CNN to classify the images. Most of the images of cars had a road in the background, while those of ships had water in the background. In such a scenario, it is more feasible for the model to learn to distinguish between the colours and shades of roads and water bodies, rather than reliably learning to classify the actual objects. To identify whether the CNN is actually looking at the target object is, therefore, of paramount use to understand whether the CNN has actually learned what the target object is, or whether it has simply picked up any other pattern which was prevalent in the dataset due to inherent biases.

To identify where CNNs are actually looking, we will dive deep into Gradient-weighted Class Activation Mapping (Grad-CAM). This technique uses the gradients of any target class, flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the class.

In simpler terms, we are simply computing the final convolutional feature map. Every channel in the feature map is weighed with the gradient of the class. It is basically highlighting how intensely the input image activates different channels, depending on how important the channel is with regard to the class. There is no need for any re-training or change in the existing architecture.

Each class is assigned a spatial score Sᶜ, which is computed by the global average pooling over two dimensions i and j, for the gradient of respective class output with respect to the feature map Aᵢⱼᵏ then, we multiply the resulting value with the feature map along with its channel axis k. We then average/pool the resultant. The ReLU activation is applied to the score map and then it is normalized to output positive region predictions.

Now, let us get started with the implementation of Grad-CAM. To demonstrate, we will use a pre-trained VGG16 network, and also import other necessary packages.

The first step is to initialize our model and load the image, depending on the path provided through the command line argument. The VGG network expects input size to be 224x224, therefore, we resize our image to the desired dimensions. Since we are passing only one image through the network, it is required to expand the first dimension noting it as a batch size of 1. We then normalize our image by subtracting mean RGB values from the input image.

We will only be visualizing the map for the top prediction, although we can compute the map for any class. We also take the output from the final convolutional layer.

Now, we compute the gradient of the class output value with respect to the feature map (the equation mentioned above). Then, we pool the gradients over all the axes leaving out the channel dimension. Finally, we weigh the output feature map with the computed gradient values.

We then average the weighted feature map along the channel dimension, which is then normalized to values between 0 and 1, to obtain the final heatmap.

Finally, we use OpenCV to read the image and resize the existing heatmap to the image size.

Output for dumbbell classification:

This technique is not only useful for localization, but also for visual question and answers, image captioning, and other inferential use-cases in computer vision. It is also very helpful in debugging the data requirements for building an accurate model. For example, we can clearly see that the current VGG16 model prioritises arm as an important feature while categorizing dumbells, which implies that we should increase instances of dumbbell images without an associated arm in the training data so that the model learns to localize the dumbbell without the arm as well. Although hyperparameter tuning is not associated with this technique, we can generalize the model better with different data augmentation techniques, and Grad-CAM can help us in identifying inherent data biases that might be present in our dataset already.

Output on aerial imagery, after retraining for classes present in aerial imagery using a custom model:

Grad-CAM is not the only method to visualise how neural nets look at our world. There are other methods like saliency maps and Deconv Nets that can be applied. I would love to hear from you on how you implemented these tools to improve the performance of neural nets. Hit me up in the comments below or reach out to varunjay@attentive.ai. Also, do visit us at Attentive AI to learn more about more such stuffs.

References

[1] R. Selvaraju, et. al. Grad-Cam: Visual Explanations from Deep Networks via Gradient-based Localization. In CoRR, 2016

[2] A. Agrawal, D. Batra, and D. Parikh. Analyzing the Behavior of Visual Question Answering Models. In EMNLP, 2016

[3] L. Bazzani, A. Bergamo, D. Anguelov, and L. Torresani. Self-taught object localization with deep networks. In WACV, 2016

[4] R. G. Cinbis, J. Verbeek, and C. Schmid. Weakly supervised object localization with multi-fold multiple instance learning. IEEE transactions on pattern analysis and machine intelligence, 2016

Opening the Black Box: How Neural Nets See our World

References

Written by Tech@Attentive