How Could Saliency Map Help to Improve Model Performance

An Exploration of Taking Saliency Maps as Input to Improve Model Performance with Transfer Learning

Authors: Cynthia (Xinyue) Wang, Yiming Xu

This is the final project for Harvard University AC295 Spring 2020 term. Thanks Prof. Pavlos Protopapas and Teaching Fellows — Michael Emanuel, Andrea Porelli, Giulia Zerbini for all the help and supports throughout the semester.

Introduction

Saliency maps have been used as a popular visualization tool to detect how and why a deep learning neural network makes certain decisions. The majority of previous work of saliency methods focused on providing interpretabilities for the neural network, and/or providing researchers directions to optimize their deep learning models. However, we have been curious about whether saliency maps can make direct contributions to the model performance. We would like to study whether saliency map can benefit the model in terms of training speed, accuracy and robustness. In this project, we explore the capability of saliency map to improve model accuracy and training speed.

In addition, training a complicated model from scratch is pretty time consuming and may not receive desirable results if training the model with scarce data. Therefore, it would be efficient and effective to use transfer learning, where the pre-trained model preserves the “knowledge” obtained from time-intensive learning based on large datasets. In this project, we apply transfer learning to our model. We use the pre-trained ResNet50 image classification model trained on ImageNet dataset to initialize model parameters, and subtract saliency information from the pre-trained model to learn “knowledge”.

This project focuses on exploring how saliency map could directly help to improve the performance of image classification model. The main task of the model is to classify facial images of celebrities with PubFig dataset. We integrate saliency information to the model and applied transfer learning to transfer “knowledge” from pre-trained models trained on larger dataset to accelerate model training. For reference, we name our model structure as Saliency Modulation Model.

Overview of Model Structure

Model Structure

In this project, we use a two-branches model structure, one RGB branch and one saliency branch, to take both original image and saliency image as input. The two branches are joined together with a modulation method. Implementation details are presented in Saliency Modulation Model section.

Related Work

The idea of utilizing saliency to directly improve model performance is to reflect on how to take saliency map as input to the neural network. One direction is to fuse saliency map with the original image, taking them jointly as input. We can call this direction as early fusion of saliency. Another direction is to fuse saliency information at later layers, modulating the CNN layer of original image with the CNN layer of saliency images. We can call this direction as delayed fusion of saliency [1].

  • Early Fusion: Murabito et al. proposed the strategy of adding a 1-channel saliency image as the fourth channel to the corresponding original image [2]. The 4-channel images were fed to the neural network as input images. Therefore, we call this strategy early fusion of saliency.
  • Delayed Fusion: Flores et al. presented an alternative to set up a two-branches model structure. The model has a two-branches CNN structure at the first few layers: one branch takes original images as input and the other branch takes the corresponding saliency maps as input. Then there is a modulation layer that modulates the output of the two branches at the fork. The author call this strategy delayed fusion of saliency.

In this project, we explore the idea of delayed fusion of saliency. We adopt the two-branches model structure proposed in previous work [1] and integrate our customized design.

Saliency maps are used as input in delayed fusion of saliency. Flores et al. experimented on different saliency methods, including top performing methods, e.g. iSEEL, SALICON, as well as baseline methods of White and Center. In this project, we applied SmoothGrad saliency method proposed by Smilkov, et al.

  • SmoothGrad [3]: The idea of SmoothGrad technique is to take an image of interest, sample similar images by adding a noise with Gaussian distribution to the original input, and then take the mean of the sampled images as output saliency.

We regard this saliency method as a relatively robust saliency method and would like to study how SmoothGrad saliency maps can improve model performance.

Data

Since we apply transfer learning in our model structure, to be clear, we call the model that we transfer knowledge from as source model, and correspondingly call the dataset used to train the source model as source data; we call the model that we transfer knowledge to as target model, and correspondingly call the dataset used to train the target model as target data.

(source: http://image-net.org)
  • Source data: We use TensorFlow pre-trained ResNet50, which is trained on ImageNet dataset. Therefore, our source data is ImageNet dataset with 1000 classes and 1000 images for each synset on average [4]
(source: https://www.cs.columbia.edu/CAVE/databases/pubfig/)
  • Target data: PubFig dataset contains 58,797 images for 200 people collected from the internet. We use this dataset as the target dataset for image classification task. The original dataset contains certain noises such as having irrelevant people in the background or repeated data. After scrapping the images, we cleaned the data for the use of our model, such as face detection, cropping, alignment and data augmentation if needed. Take one of the image for Abhishek Bachchan as an example in Figure 1:
Figure 1

In addition, we also balance the dataset among different people. After data balancing, we have 4197 images for 21 different people. The distribution for the number of images for each people is shown in Figure 2:

Figure 2

Pre-trained Model–ResNet50

We used TensorFlow pre-trained ResNet50 [7]. As mentioned previously, the TensorFlow pre-trained ResNet50 was trained on ImageNet with 1000 classes and 1000 images per class on average. The structure of ResNet50 [6] is summarized as below:

The ‘ID BlOCK’ refers to the identity block and the ‘CONV BLOCK’ refers to the convolutional block in the ResNet. In particular, for each block the skip connection skips over three layers.

The detailed structure for each block is summarized as below [8]:

ID BLOCK
CONV BLOCK

Saliency Map as Input

Why do we need saliency map as input?

Because the saliency branch (structure details discussed in section Saliency Modulation Model) in our model structure requires saliency maps as input. Therefore, we need to generate saliency map images for both the source data (ImageNet) and the target data (PubFig).

How did we generate saliency Map?

In this project, we applied SmoothGrad saliency method [3]. TensorFlow implemented the SmoothGrad saliency function in the latest versions of tf-keras-vis.

To pre-train the saliency modulation model, we need to generate saliency maps of the source data (ImageNet) with respect to each corresponding class using the source model. Note that the pre-trained saliency modulation model, which is set up and trained in this project is not the same as the source model, which is pre-trained by TensorFlow. Please don’t be confused by the naming. Details are presented in Saliency Modulation Model section.

To train the target model, we need to generate saliency maps of the target data (PubFig) using the source model (pre-trained ResNet50) as well. In this case, we need to calculate the saliency map for PubFig images with respect to an ImageNet class, which ideally should be able to detect human facial features. Unfortunately, there is no human-related image classes used by TensorFlow to pre-train ResNet50. However, the work of Yosinski et al. shows that image classification model can learn to recognize human faces even if there is no human face images classes but there are other animals’ faces in the dataset. Hence, we can use the image class that has similar feature as human facial features. In our experiment, the results are surprisingly good.

Here is an example of the image and saliency map of Alyssa Milano:

There may be other ImageNet classes that is more suitable for transferring human facial features to the feature learning of PubFig images. This is remained for the future work.

Details of generating saliency maps and instruction for replicating the work can be find at GenerateSaliency.ipynb

Baseline Transfer Model

Baseline Transfer Model

The baseline transfer model simply uses the TensorFlow pre-trained ResNet50 [7] to initialize the model weights. The structure of the ResNet50 model is presented in the previous section. We add multiple fully connected layers with 1024 nodes after flattening and also add a drop-out layer with rate 0.5 to prevent overfitting.

Saliency Modulation Model

Model Structure

For the saliency modulation model, we’ve applied the two-branches structure: RGB branch which takes the original image as input (with shape (H,W,C)) and the saliency branch which takes the pre-computed corresponding saliency map as input (with shape (H,W,1)). The output of the model is ground-truth label for the images. We will introduce the architecture step by step.

  • Input Images: As described in the previous section, two datasets are used, the source dataset ImageNet, which is used for training Tensorflow pre-trained ResNet50, and the target dataset PubFig, which is used for training the target model. In addition, ImageNet images with the corresponding saliency image are used for pre-training the Saliency Modulation Model. Likewise, PubFig images with the corresponding saliency images are used to train the Saliency Modulation Model for the main image classification task.
  • RGB Branch: for the RGB branch, the structure is the same as the baseline transfer. It takes the 3-channel original image as input. The structure is shown below:
  • Saliency Branch: for the saliency branch, the structure is almost the same with the RGB branch to make sure its spatial dimension can match the one from RGB branch during the fusion stage. However, there are several designs need to be noticed: (1) the input of saliency branch is the 1-channel saliency map; (2) different from RGB branch with the ReLu non-linearity, the sigmoid activation is applied at the end of saliency branch. This design will help to make sure the output of saliency branch is within [0, 1], which provides a suitable range of feature modulation[1].The structure is shown below:
  • Modulation: After several experiments, the two branches are chosen to be combined by modulation (x symbol, which stands for the element-wise production) after the stage1 of the ResNet but before the maxpooling layer. The reason for the fusion before maxpooling is to make full use of saliency in higher resolution. In addition, we’ve also borrowed the idea of ‘skip connection’, which prevents the model from completely ignoring the features from RGB branch[10]. To be more advanced, a parameter can be assigned to control the importance of this skip connection. The structure is shown below:
  • Weight Initialization: there are two weight initialization methods that we’ve tried:
    (1) Use the pre-trained weight from ResNet50 on RGB branch (without fully connected layers), use Xavier uniform to initialize the rest. Details are shown in Figure 4. We will refer it as ‘half pre-trained’ in the following section.
    (2) Pre-train the whole 2-branch structure on ImageNet, and then use the weights to initialize all the weights of our network except those for the fully connected layers. Details are shown in Figure 5. We will refer it as ‘fully pretrained’ in the following section.
Figure 4
Figure 5

Performance Results

For the PubFig dataset, we have 2218 images for training, 555 for validation and 309 for test. We will mainly compared the performance between three models: baseline transfer model, saliency modulation model(half pretrained) and saliency modulation model(fully pretrained). For all models, we use cross-validation and fine tuned it for 50 epochs. We use SGD as optimizer, with learning rate as 0.0001 and momentum as 0.9. For all the layers do not have a pre-trained weight to initialize, we use Xavier.

First, let’s take a look at the performances from baseline transfer and saliency modulation with only pretrained weight for RGB branch.

Performance of Baseline Transfer Model

From the plot and the test accuracy statistic above we can see that the two model seem to converge into equally good solutions. However, we can compare the convergence speed of the two models by zoom-in their learning curve on the first 10 epochs and get the plot below:

Training Plot of Baseline Transfer Model

From the plot above wen can see that saliency modulation model converges faster than the baseline transfer learning.

Let’s then compare the performances between the half pretrained and the fully pretrained ones.

It’s very surprising to see that the test performance has dramatically decreased and the problem of overfitting has become more severe, which contradicts the findings in [1] as well as the intuition. After careful inspection, we find that the possible reason is that we don’t get a decent pre-trained 2 branches model within our limited time and resource. The pre-trained model itself has a severe problem of overfitting already, and therefore can not be used as a good initialization, or even hurt the performance of the target model.

To be more specific, the ImageNet data (ImageNet original with its corresponding saliency map) we’ve construct at our best for pretraining has 8828 training, 2057 validation and 1143 test images. By contrast, the half-pretrained model uses the weight trained on the whole ImageNet training set, which contains more than millions of images. The best pre-trained model we get has 0.9983 accuracy on test set with less than 0.60 test accuracy, which already indicates the problem of strong overfitting. However, during our experiments we’ve find that when we increase the valid data volume to the pre-trained model, the performance of the pre-trained model will be improved, and the performance of our target model will be improved as well. Therefore, it’s very promising that if the resource is enough, the performance of the target model can be largely improved, and even outperform the baseline transfer in both speed and accuracy.

In conclusion, we’ve found that when both only initialized with the ResNet50 weights trained on the ImageNet, the saliency modulation model can largely improve the learning speed than the baseline, while the improvement on the convergence result is subtle. In addition, based on the experiments we’ve done so far, we believe that if the saliency modulation model can be fully pretrained properly, it will outperform the baseline transfer model not only in speed, but also in accuracy.

Conclusion & Future Work

In this project, we studied how saliency map could directly help to improve model performance. We explored modeling approaches of taking saliency maps as model input and applied delayed fusion technique to integrate saliency information into a two-branches model structure. In addition, we used transfer learning to transfer knowledge from models pre-trained on large datasets to the training with scarce datasets.

As the experiments show, Saliency Modulation Model has faster training speed. The intuition behind is straightforward: saliency maps generated from the pre-trained model contain “knowledge” of recongizing objects from the background, and when we fuse these saliency information to the model, the model can quickly detect the most representative area of the object and thus can learn useful features more efficiently. Due to time and resource constraints for pre-training saliency modulation model, the target model accuracy is not as expected. However, we can see that as we enlarge the dataset for model pre-training, the target model test accuracy increases. This demonstrate that the current pre-trained saliency modulation model is overfitted. In other words, if we pre-train the saliency modulation model on a larger source dataset, the target model would receive better results. Theoretically, we expect that the target saliency modulation model should receive better accuracy with less epoches compared to the baseline model.

In addition, the source data does not contain object classes related to humans. Therefore, the current saliency map may not provide the optimal saliency information for human facial features detection. Although using non-human facial image class generate relatively desirable saliency maps, it would be even better if we can pre-train the source model with human facial image classes, which meanwhile requires huge computation resource and time to train.

Reference

[1] Flores, Carola Figueroa, et al. “Saliency for Fine-Grained Object Recognition in Domains with Scarce Training Data.”” Pattern Recognition, Pergamon, 4 May 2019, www.sciencedirect.com/science/article/pii/S0031320319301773.

[2] Murabito, Francesca, et al. “Top-down Saliency Detection Driven by Visual Classification.” Computer Vision and Image Understanding, Academic Press, 21 Mar. 2018, www.sciencedirect.com/science/article/pii/S1077314218300407.

[3] Smilkov, Daniel, et al. “SmoothGrad: Removing Noise by Adding Noise.” ArXiv.org, 12 June 2017, arxiv.org/abs/1706.03825.

[4] TensorFlow. “imagenet2012:TensorFlow Datasets.” TensorFlow, www.tensorflow.org/datasets/catalog/imagenet2012.

[5] “PubFig: Public Figures Face Database” www.cs.columbia.edu/CAVE/databases/pubfig/

[6] He, Kaiming, et al. “Deep Residual Learning for Image Recognition.” ArXiv.org, 10 Dec. 2015, arxiv.org/abs/1512.03385.

[7] “tf.keras.applications.ResNet50”
www.tensorflow.org/api_docs/python/tf/keras/applications/ResNet50

[8] “Convolutional Neural Networks.” Coursera, www.coursera.org/learn/convolutional-neural-networks?specialization=deep-learning#syllabus.

[9] Yosinski, Jason, et al. “Understanding Neural Networks Through Deep Visualization.” ArXiv.org, 22 June 2015, arxiv.org/abs/1506.06579.

[10] Zagoruyki, Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer”, International Conference on Learning Representations, 2017

[11] Levine, Alexander, et al. “Certifiably Robust Interpretation in Deep Learning.” ArXiv.org, 17 Oct. 2019, arxiv.org/abs/1905.12105.

For more details about the course, please refer to AC295 web page. Thank you!

--

--