Review — Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks (Weakly Supervised Object Localization)
In this story, Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks, Grad-CAM++, by Indian Institute of Technology Hyderabad, and Cisco Systems, is presented. In this paper:
- Grad-CAM++, built on Grad-CAM, provides better visual explanations of CNN model predictions, in terms of better object localization as well as explaining occurrences of multiple object instances in a single image.
- A weighted combination of the positive partial derivatives of the last convolutional layer feature maps with respect to a specific class score as weights is used to generate a visual explanation for the corresponding class label.
This is a paper in 2018 WACV with over 260 citations. (Sik-Ho Tsang @ Medium)
1. Brief Review of CAM
- In CAM, the CNN needs to be modified, thus requiring retraining. Fully connected layers need to be removed.
- Instead, Global Average Pooling (GAP) is used before softmax.
- The final classification score Yc for a particular class c can be written as a linear combination of its global average pooled last convolutional layer feature maps Ak:
- Each spatial location (i, j) in the class-specific saliency map Lc is then calculated as:
- Lcij directly correlates with the importance of a particular spatial location (i, j) for a particular class c and thus functions as a visual explanation of the class predicted by the network.
2. Brief Review of Grad-CAM
- Grad-CAM was built to address the issues of CAM, so that it no needs any retraining or architectural modification.
- Grad-CAM backprops to calculate the weight.
- The weights wck for a particular feature map Ak and class c is defined as:
- where Z is a constant (number of pixels in the activation map).
- Grad-CAM can thus work with any deep CNN where the final Yc is a differentiable function of the activation maps Ak.
- To obtain fine-grained pixel-scale representations, the Grad-CAM saliency maps are upsampled and fused via point-wise multiplication with the visualizations generated by Guided Backpropagation. This visualization is referred to as Guided Grad-CAM.
However, it is found out that Grad-CAM fails to properly localize objects in an image if the image contains multiple occurrences of the same class.
Another consequence of an unweighted average of partial derivatives is that often, the localization doesn’t correspond to the entire object, but bits and parts of it.
- This can be shown in the first figure at the top.
- As mentioned and shown above, CAM needs network modification.
- Grad-CAM estimates the weight by dividing Z, i.e. the size of the feature map. If the response is small or the area of the response is small, the weight becomes smaller.
- Grad-CAM++ solve the above issues by a more sophisticated backprop:
- Such that:
- if αkcij = 1/Z, Grad-CAM++ reduces to the formulation for Grad-CAM.
- Thus, Grad-CAM++, as its name suggests, can be (loosely) considered a generalized formulation of Grad-CAM.
- (I only have a brief review on Grad-CAM++. More detailed derivation of equations is described in the paper. If interested, please feel free to read the paper.)
- Comparing with the input image I, it is evident that the spatial footprint of an object in an image is important for Grad-CAM’s visualizations to be strong.
- Hence, if there were multiple occurrences of an object with slightly different orientations or views (or parts of an object that excite different feature maps), different feature maps may be activated with differing spatial footprints, and the feature maps with lesser footprints fade away in the final saliency map.
This can be as shown in the example of the above figure. Feature maps with smaller areas tend to have smaller values (0.26 & 0.13) obtained, makes them less important.
But using Grad-CAM++, the issue of Grad-CAM as mentioned above is solved. All values in the example are 1.0, which are equally important.
- Similar to Grad-CAM, to generate the final saliency maps, we carry out pointwise multiplication of the upsampled (to image resolution) saliency map Lc with the pixel-space visualization generated by Guided Backpropagation. The representations thus generated are hence called Guided Grad-CAM++.
4. Experimental Results
4.1. Object Recognition
- VGG-16 is used.
- The pixel wise weighting adopted by Grad-CAM++ in generating the
visual explanations is more model-appropriate and consistent
with the model’s prediction.
- 13 human subjects are asked which one is more trust in the underlying model.
- Grad-CAM++ achieved a score of 109.69 as compared to 56.08 of Grad-CAM. The remaining 84.23 was labeled as ”same” by the subjects.
4.2. Object Localization
- The results show Grad-CAM++’s improvement over Grad-CAM on this metric too. In particular, the IoU improvement increases with greater values of δ, the threshold of generating the binary map.
- CAM’s heatmaps of the objects have lower intensity values in general, when compared to Grad-CAM++.
4.3. Knowledge Distillation
- A WRN-40–2 teacher network (2.2 M parameters) is trained on the CIFAR-10 dataset. In order to train a student WRN-16–2 network (0.7 M parameters), a modified loss Lexp_student is trained, which is a weighted combination of the standard cross entropy loss Lcross_ent and an interpretability loss, Linterpret:
- where Linterpret is defined as:
- The L in Linterpret is based on the weights found by Grad-CAM or Grad-CAM++.
- Grad-CAM++ provides better explanation-based knowledge distillation than Grad-CAM.
- The student considered had a 68.18% reduction in the number of parameters.
- There is increase in the mean Average Precision (mAP) of about 35% as compared to training the student network solely on the VOC 2007 train set
- The student network was a shallower 11-layer CNN with 27M parameters (an 80% reduction).
4.4. Image Captioning and 3D Action Recognition
- The architecture used here is Show-and-Tell model, which includes a CNN to encode the image followed by an LSTM to generate the captions.
- Grad-CAM++ produces more complete heatmaps than Grad-CAM.
- Grad-CAM++’s visualization gives insight into what the network focused on.
- C3D model is trained on Sports-1M dataset.
- The performance of Grad-CAM++ is better than Grad-CAM in all the metrics.
- The explanations generated by Grad-CAM++ are more semantically relevant.
- Grad-CAM++ tends to highlight the context of the video (similar to images) as less bright and most discriminative parts as brighter regions in the video explanations.
[2018 WACV] [Grad-CAM++]
Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks