Class Activation maps

3 min readMay 31, 2019

This work is a part of the AI Without Borders Initiative.

Co-contributors: Ninad Shukla, Chinmay Pathak, Kevin Garda, Tony Holdroyd, Daniel J Broz.

Read about Grad-CAM here.
Read about Grad-CAM++ here.

Class Activation Maps helps in the analysis of understanding as to what regions of an input image influence the convolutional Neural Network’s output prediction. The technique relies on the heat map representation which highlights pixels of the image that triggers the model to associate the image with a particular class. The layers of the CNN behave similar to unsupervised object detectors in this case.

The implementation of the class activation map technique relies on the global average pooling layers which are augmented after the final convolutional layer to spatially diminish the image dimensions and reduce parameters hence minimizing over-fitting.

The global average pooling layer works as follows.

Each image category in the dataset is associated with one activation map and the layer calculates the average for each feature map. A pictorial representation of the same is as shown below.

Feature map is the output of one filter applied to the previous layer.

For example:- If the last layer has 20, 3*3 feature maps (20*3*3) then global average pooling will calculate the average of all the pixel points for each feature map and output data points. Thus, 20 feature maps will output 20 data points.

The assumption of the CAM model is that the final score as described below can always be expressed as a linear combination of pooled average feature maps.

The generic flow through a CAM supported architecture is as described below.

Post the last convolutional layer in a typical neural network, assume an N dimensional image with N number of filters; If the input image is k*k, the output shape of the final convolutional layer is N*i*i where i is the number of filters in the previous layer. The global pooling layer takes the N channels and returns their spatial average where channels with higher activations have higher signals. A weight is then assigned to each output per category by either augmenting dense linear layers with softmax or by stacking linear classifiers atop of GAP. A heat-map is then created per class as output images from the previous layer are employed in calculating their weighted sum.

In order to obtain the class activation map, we need to compute the weighted sum defined by:

W1⋅f1+w2⋅f2+…+w2048⋅f2048.

By up-sampling the class activation map to the size of the input image, we can identify the image regions most relevant to any particular category.

The drawback of CAM is that it requires changing the network structure and then retraining it. This also implies that current architectures which don’t have the final convolutional layer — global average pooling layer — linear dense layer — structure, can’t be directly employed for this heat map technique. The technique is constrained to visualization of the latter stages of image classification or the final convolutional layers.

References:

Class Activation maps

Written by Gagana B