Explainable AI Grad-CAM
Explainable AI using Grad-CAM
Explainable AI refers to method used in analyzing deep learning algorithm by AI experts, business users, etc. It’s provides the much-required visibility in the subtle working of the algorithm — the why and how of it.
Gradient weighted Class Activation Mapping (Grad-CAM) uses gradients of a particular target that flows through the convolutional network to localize and highlight regions of the target in the image.
Grad-CAM have immense scope in computer vision tasks like image classification, object detection, semantic segmentation, image captioning, visual question answering, etc. It also enables to interpret the algorithm — “why they predict what they predict”.
It provides good visual explanation of the model by outputting the following for target category:
Class discriminative — localization of the category in the image
High resolution — capture fine grained details
Grad-CAM is fused with existing pixel-space gradient visualizations to create Guided Grad-CAM that results in high resolution detail of the target class in an image.
In the above figure — (a) Original image of cat and dog, (b) Guided backpropagation — highlights all contributing features, © localizes class-discriminative regions, (d) combines (b) and © to provide high resolution class-discriminative visualization using Guided Grad-CAM, (e) Occlusion sensitivity
(f) ResNet Grad-CAM where red regions correspond to high score for class, while blue corresponds to evidence for the class, (g–l) does the same computation done in (a-f) but for “Dog” class
In the above figure — the image and the class of interest is forward propagated through CNN and through task specific computation , the gradient of the desired class is set to 1 and rest are made 0.This is then backpropagated through rectified convolutional feature maps of interest that combines to compute the coarse Grad-CAM localization(blue heatmap).Finally, a pointwise multiplication of the heat map with guided backpropagation is performed to output Guided Grad-CAM visualizations
Computation of class-discriminative localization map through Grad-CAM:
(i) Gradient of the score for class ( suppose c), with respect to feature map activations of a convolution layer :
(ii) Gradients flowing back are global average pooled over the width and height dimension (indexed by I and j respectively):
(iii) The neuron importance weights are obtained:
(iv) Weighted combination of forward activation map is performed , followed by ReLU to obtain:
Grad-CAM generalizes CAM:
(i) In the below equation Ak refers to the activation at location(i,j) of the feature map AK. Feature maps are spatially pooled using Global Average Pooling(GAP) and linearly transformed to produce a score of Yc for each class c,
(ii) Let’s define F to be the global average pooled output,
(iii) CAM computes final scores
(iv) Gradient score with respect to feature map
(v) Taking partial derivation of (4) ,it can be seen that
Substituting this in (6):
(vi) From (5) , we get that,
(vii) Summing both sides of (8) over all pixels (i,j),
(viii) This could be rewritten as
(ix) Z is the number of pixels in feature map . Thus, terms can be reordered:
Thus, Grad-CAM is a strict generalization of CAM.
Though Grad-CAM is class discriminative and localizes relevant image regions, it lacks ability to highlight fine-grained details. In order to overcome this, the Guided Backpropagation and Grad-CAM visualizations are fused via element wise multiplication to produce visualizations that are high-resolution(when class of interest is ‘tiger cat’, it identifies important ‘tiger cat’ features like stripes, pointy ears and eyes) and class discriminative.
The network can be made to change its predictions by negating the gradient of yc with respect to feature maps A of a convolutional layer. Thus, the importance weights now become:
Localization ability of Grad-CAM:
Weakly-supervised localization — For a given image, first class prediction is performed, followed by generation of Grad-CAM maps for each predicted classes and binarization with a threshold of 15% of max intensity resulting in connected segments of pixels and drawing of bounding box around single largest segment. This is weakly supervised localization since models were not exposed to bounding box annotations during training.
Evaluation of Grad-CAM localization with pretrained models:
Weakly-supervised Segmentation — The task of weakly-supervised segmentation involves segmenting objects with just image-level annotation, which can be obtained relatively cheaply from image classification datasets. The CAM maps were replaced with Grad-CAM from a standard VGG-16 network to obtain an Intersection over Union (IoU) score of 49.6 (compared to 44.6 obtained with CAM) on the PASCAL VOC 2012 segmentation task.
Pointing Game — This experiment evaluates the discriminativeness of different visualization methods for localizing target objects in scenes, here also Grad-CAM output is significantly better.
Diagnosing image classification CNNs with Grad-CAM
Analyzing failure modes for VGG-16:
In these cases, the model (VGG-16) failed to predict the correct class in its top 1 (a and d) and top 5 (b and c) predictions. Humans would find it hard to explain some of these predictions without looking at the visualization for the predicted class. But with Grad-CAM, these mistakes seem justifiable.
Effect of adversarial noise on VGG-16:
Original image and the generated adversarial image for category “airliner”. (c-d) Grad-CAM visualizations for the original categories “tiger cat” and “boxer (dog)” along with their confidence. Despite the network being completely fooled into predicting the dominant category label of “airliner” with high confidence (>0.9999), Grad-CAM can localize the original categories accurately. (e-f) Grad-CAM for the top-2 predicted classes “airliner” and “space shuttle” seems to highlight the background
Identifying bias in dataset:
In the first row, we can see that even though both models made the right decision, the biased model (model1) was looking at the face of the person to decide if the person was a nurse, whereas the unbiased model was looking at the short sleeves to make the decision. For the example image in the second row, the biased model made the wrong prediction (misclassifying a doctor as a nurse) by looking at the face and the hairstyle, whereas the unbiased model made the right prediction looking at the white coat, and the stethoscope.
Grad-CAM for Image Captioning and Visual Question Answering
Qualitative Results for our word-level captioning experiments: (a) Given the image on the left and the caption, we visualize Grad-CAM maps for the visual words “bike”, “bench” and “bus”. Note how well the Grad-CAM maps correlate with the COCO segmentation maps on the right column. (b) shows a similar example where we visualize Grad-CAM maps for the visual words “people”, “bicycle” and “bird”.
Visual Question Answering:
To conclude — Gradient-weighted Class Activation Mapping (Grad-CAM) is simple but a very powerful technique for producing visual explanations.
Github link: https://github.com/amitaug1984/Grad-CAM