Review — Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks (Weakly Supervised Object Localization)

From CAM, Grad-CAM, to Grad-CAM++. Better Visual Explanations Than Grad-CAM

Sik-Ho Tsang
The Startup
7 min readJan 3, 2021

--

All dogs (Multiple objects) are better visualized (1st and 2nd Rows) and the entire region of the class is localized (3rd and 4th Rows) in the Grad-CAM++ and Guided Grad-CAM++ saliency maps while Grad-CAM heatmaps only exhibit partial coverage.

In this story, Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks, Grad-CAM++, by Indian Institute of Technology Hyderabad, and Cisco Systems, is presented. In this paper:

  • Grad-CAM++, built on Grad-CAM, provides better visual explanations of CNN model predictions, in terms of better object localization as well as explaining occurrences of multiple object instances in a single image.
  • A weighted combination of the positive partial derivatives of the last convolutional layer feature maps with respect to a specific class score as weights is used to generate a visual explanation for the corresponding class label.

This is a paper in 2018 WACV with over 260 citations. (Sik-Ho Tsang @ Medium)

This series of CAM technique has been widely used in many SOTA papers such as EfficientNet (It uses CAM) for visual explanation of models. It is worth reading this series of papers.

Outline

  1. Brief Review of CAM
  2. Brief Review of Grad-CAM
  3. Grad-CAM++
  4. Experimental Results

1. Brief Review of CAM

  • In CAM, the CNN needs to be modified, thus requiring retraining. Fully connected layers need to be removed.
  • Instead, Global Average Pooling (GAP) is used before softmax.
  • The final classification score Yc for a particular class c can be written as a linear combination of its global average pooled last convolutional layer feature maps Ak:
  • Each spatial location (i, j) in the class-specific saliency map Lc is then calculated as:
  • Lcij directly correlates with the importance of a particular spatial location (i, j) for a particular class c and thus functions as a visual explanation of the class predicted by the network.

2. Brief Review of Grad-CAM

  • Grad-CAM was built to address the issues of CAM, so that it no needs any retraining or architectural modification.
  • Grad-CAM backprops to calculate the weight.
  • The weights wck for a particular feature map Ak and class c is defined as:
  • where Z is a constant (number of pixels in the activation map).
  • Grad-CAM can thus work with any deep CNN where the final Yc is a differentiable function of the activation maps Ak.
  • To obtain fine-grained pixel-scale representations, the Grad-CAM saliency maps are upsampled and fused via point-wise multiplication with the visualizations generated by Guided Backpropagation. This visualization is referred to as Guided Grad-CAM.

However, it is found out that Grad-CAM fails to properly localize objects in an image if the image contains multiple occurrences of the same class.

Another consequence of an unweighted average of partial derivatives is that often, the localization doesn’t correspond to the entire object, but bits and parts of it.

  • This can be shown in the first figure at the top.

3. Grad-CAM++

An overview of all the three methods — CAM, Grad-CAM, Grad-CAM++
  • As mentioned and shown above, CAM needs network modification.
  • Grad-CAM estimates the weight by dividing Z, i.e. the size of the feature map. If the response is small or the area of the response is small, the weight becomes smaller.
  • Grad-CAM++ solve the above issues by a more sophisticated backprop:
  • Such that:
  • if αkcij = 1/Z, Grad-CAM++ reduces to the formulation for Grad-CAM.
  • Thus, Grad-CAM++, as its name suggests, can be (loosely) considered a generalized formulation of Grad-CAM.
  • (I only have a brief review on Grad-CAM++. More detailed derivation of equations is described in the paper. If interested, please feel free to read the paper.)
A hypothetical example elucidating the intuition behind Grad-CAM++
  • Comparing with the input image I, it is evident that the spatial footprint of an object in an image is important for Grad-CAM’s visualizations to be strong.
  • Hence, if there were multiple occurrences of an object with slightly different orientations or views (or parts of an object that excite different feature maps), different feature maps may be activated with differing spatial footprints, and the feature maps with lesser footprints fade away in the final saliency map.

This can be as shown in the example of the above figure. Feature maps with smaller areas tend to have smaller values (0.26 & 0.13) obtained, makes them less important.

But using Grad-CAM++, the issue of Grad-CAM as mentioned above is solved. All values in the example are 1.0, which are equally important.

  • Similar to Grad-CAM, to generate the final saliency maps, we carry out pointwise multiplication of the upsampled (to image resolution) saliency map Lc with the pixel-space visualization generated by Guided Backpropagation. The representations thus generated are hence called Guided Grad-CAM++.

4. Experimental Results

4.1. Object Recognition

Grad-CAM++ and Grad-CAM on the ImageNet validation set
Grad-CAM++ and Grad-CAM on the PASCAL VOC 2007 validation set
  • VGG-16 is used.
  • The pixel wise weighting adopted by Grad-CAM++ in generating the
    visual explanations is more model-appropriate
    and consistent
    with the model’s prediction.
Sample visual explanations on ImageNet generated by Grad-CAM and Grad-CAM++
  • 13 human subjects are asked which one is more trust in the underlying model.
  • Grad-CAM++ achieved a score of 109.69 as compared to 56.08 of Grad-CAM. The remaining 84.23 was labeled as ”same” by the subjects.

4.2. Object Localization

IoU results for object localization on the PASCAL VOC 2012 val set
  • The results show Grad-CAM++’s improvement over Grad-CAM on this metric too. In particular, the IoU improvement increases with greater values of δ, the threshold of generating the binary map.
Object localization capabilities of Grad-CAM and Grad-CAM++
  • CAM’s heatmaps of the objects have lower intensity values in general, when compared to Grad-CAM++.

4.3. Knowledge Distillation

  • A WRN-40–2 teacher network (2.2 M parameters) is trained on the CIFAR-10 dataset. In order to train a student WRN-16–2 network (0.7 M parameters), a modified loss Lexp_student is trained, which is a weighted combination of the standard cross entropy loss Lcross_ent and an interpretability loss, Linterpret:
  • where Linterpret is defined as:
  • The L in Linterpret is based on the weights found by Grad-CAM or Grad-CAM++.
Results for knowledge distillation to train a student (WRN-16–2) from a teacher network (WRN-40–2)
  • Grad-CAM++ provides better explanation-based knowledge distillation than Grad-CAM.
  • The student considered had a 68.18% reduction in the number of parameters.
Results for training a student network with explanations from the teacher (VGG-16 fine-tuned) and with knowledge distillation on PASCAL VOC 2007 dataset
  • There is increase in the mean Average Precision (mAP) of about 35% as compared to training the student network solely on the VOC 2007 train set
  • The student network was a shallower 11-layer CNN with 27M parameters (an 80% reduction).

4.4. Image Captioning and 3D Action Recognition

Visual explanations of image captions
  • The architecture used here is Show-and-Tell model, which includes a CNN to encode the image followed by an LSTM to generate the captions.
  • Grad-CAM++ produces more complete heatmaps than Grad-CAM.
  • Grad-CAM++’s visualization gives insight into what the network focused on.
Results on the 3D action recognition task for visual explanations
  • C3D model is trained on Sports-1M dataset.
  • The performance of Grad-CAM++ is better than Grad-CAM in all the metrics.
  • The explanations generated by Grad-CAM++ are more semantically relevant.
Example explanation maps for video frames generated by Grad-CAM and Grad-CAM++ for a particular predicted action.
  • Grad-CAM++ tends to highlight the context of the video (similar to images) as less bright and most discriminative parts as brighter regions in the video explanations.

Reference

[2018 WACV] [Grad-CAM++]
Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks

Weakly Supervised Object Localization (WSOL)

2014 [Backprop] 2016 [CAM] 2017 [Grad-CAM] [Hide-and-Seek] 2018 [Grad-CAM++] [ACoL] [SPG] 2019 [CutMix] [ADL] 2020 [Evaluating WSOL Right] [SAOL]

My Other Previous Paper Readings

--

--

Sik-Ho Tsang
The Startup

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.