Gradient-weighted Class Activation Mapping - Grad-CAM-

4 min readMar 14, 2019

Introduction

A technique for making Convolutional Neural Network (CNN)-based models more transparent by visualizing the regions of input that are “important” for predictions from these models — or visual explanations

This visualization is both high-resolution (when the class of interest is ‘tiger cat’, it identifies important ‘tiger cat’ features like stripes, pointy ears and eyes) and class-discriminative (it shows the ‘tiger cat’ but not the ‘boxer (dog)’).

Proposed approach

Backpropagation

The gradient of the loss (for category cat) wrt the input pixels gives,

It’s pretty noisy!

Deconv and Guided Backprop

This gives much cleaner results

Now lets take a different picture, which contains 2 categories

lets visualize the important regions for each of these 2 categories using Guided Propagation, which gives:

This is bad. The visualization is unable to distinguish between pixels of cat and dog. In other words, the visualization is not class-discriminative.

CAM

Modifying the base network to remove all fully-connected layers at the end, and including a tensor product (followed by softmax), which takes as input the Global-Average-Pooled convolutional feature maps, and outputs the probability for each class.

https://alexisbcook.github.io/2017/global-average-pooling-layers-for-object-localization/

Note that this modification of architecture forces us to retrain the network.

Can we get these visualizations without changing the base model, and without any re-training?

Let’s see how Grad-CAM discovers these weight of importance without any training.

To obtain the class-discriminative localization map, Grad-CAM computes the gradient of yc (score for class c) with respect to feature maps A of a convolutional layer. these gradients flowing back are global-average-pooled to obtain the importance weights αck:

Similar to CAM, Grad-CAM heat-map is a weighted combination of feature maps, but followed by a ReLU:

results in a coarse heat-map of the same size as the convolutional feature maps (14×1414×14 in the case of last convolutional layers of VGG and AlexNet networks)

If the architecture is already CAM compatible — the weights learned in CAM are precisely the weights computed in Grad-CAM. Other than the ReLU, this makes Grad-CAM a generalization of CAM. This generalization is what allows Grad-CAM to be applicable to any CNN-based architecture.

Guided Grad-CAM

While Grad-CAM visualizations are class-discriminative and localize relevant image regions well, they lack the ability to show fine-grained importance like pixel-space gradient visualization methods (Guided Backpropagation and Deconvolution). For example take the case of the left image in the above figure, Grad-CAM can easily localize the cat region; however, it is unclear from the low-resolutions of the heat-map why the network predicts this particular instance is ‘tiger cat’. In order to combine the best aspects of both, we can fuse Guided Backpropagation and the Grad-CAM visualizations via a pointwise multiplication. GradCAM overview figure above illustrates this fusion.