Paper Club: Visualizing and Understanding Convolutional Networks

James Vanneman
7 min readJul 17, 2017

--

For this weeks paper, we chose to read a paper from 2013 that introduced some interesting techniques for trying to understand what’s happening inside a convolutional net.

Big question:

How can we better understand what’s happening inside neural networks?

Background summary:

Visualizing convolutional networks past the first layer is challenging because after first layer convolutions, we can’t remap the feature maps back to pixel space (e.g. see what features the network is learning). Other approaches to visualizing don’t completely capture which parts of the input image stimulate various feature maps.

How can you even remap first layer convolutions back to pixel space? This seems like it would have to be before a pooling operation.

Specific Questions:

  1. Can we identify what patterns in the input image will activate feature maps at different layers?
  2. How long do the feature maps take to converge?(e.g. how much training do we need to do until the network learns how to identity a part of an image?)
  3. Can we use visualization to design better convolutional networks?
  4. When doing image recognition, we want the net to recognize the object, as opposed to predicting the object based on its surroundings. Can we use these techniques to determine if the net is identifying the intended object?
  5. CNNs aren’t explicitly told about the relationships of an object’s features, for example how a mouth is spaced relative to the nose and eyes. Are CNNs implicitly computing these relationships during training?

Methods:

Experiment 1: Visualization

  • Choose a random set of feature maps from different layers, intercept a transformation and apply the convolutional steps in reverse — unpooling -> Relu -> Conv Filter (with transpose of previous convolutional layer’s weights).
  • Monitor output of maps over training epochs
  • Apply transformations to the input image and see what happens to the feature maps.
  • Use the visualization to try and architect a better network.

Experiment 2:

  • Block out a part of the object in the input image, if prediction accuracy doesn’t suffer, we can infer that the network isn’t learning specific features of the object

If the net still predicts dog with the same accuracy in the second image, there’s a problem.

Experiment 3:

  • Slightly more detailed version of experiment 2. Use a grey square to occlude two different images, and measure the hamming distance.

Look at how similar the change in feature maps is.

Experiment 4:

  • Use visualization to design a new architecture and test against current state of the art MNIST results
  • Test for transfer learning
  • Test the model when removing various layers
  • Test the model with varying layer sizes

Results:

Experiment 1:

By tracking the feature projections over time, we can see that the early layers converge early, that is, they learn how to identify a particular aspect of the image quickly. The higher layers of the network learn more sophisticated structures but take a much longer time to converge.

Layer 1:

epochs [1,2,5,10,20,30,40,64].

The early layers represent the most basic part of an image, edges, corners etc.

Layer 5:

epochs [1,2,5,10,20,30,40,64].

You can see much more detail being recognized in layer 5 but it takes 40+ epochs before they get there.

How much of an effect does this convergence have on the accuracy and loss function of the network? The author does not give numbers on this.

How do these visualizations change as the depth of the model changes? Does layer 5 always converge in the same way?

How would using global average pooling instead of max pooling effect these projections?

Is it possible to use these visualizations to iteratively build better models?

Experiment 2:

The image on the right shows the networks prediction when the gray square covers a certain region. The light blue indicates that when the grey square is covering the dogs face, the network is predicting “Tennis Ball”. This result indicates that the network is indeed learning to identity the object as opposed to the object based on the context of the image.

Occluded portion of the image on the left and prediction probability as a function of the occluded location

Experiment 3:

Hamming Distances

This experiment measured the hamming distance(described above) of different images as they are occluded by a grey square. A low value indicates that the feature maps between difference images change in the same way. If we occlude an important part of the image(nose, eyes), and the feature maps of different images change in the same way, this shows that the network is learning about a specific feature(e.g. eyes) of the image. By looking at this difference across important features, we can tell that the network is learning about each eye as well as the nose.

Experiment 4:

The authors use the visualization to identify problems in the current state of the art models. The picture below shows feature maps from the second layer convolutions from current state of the art (Figure B, top) and the authors model (Figure C, bottom). The visualizations indicate certain “dead” feature maps.

For the layer five, the authors identified feature maps that show errors introduced into the network.

To fix these issues two issues, the authors reduced the filters from 11x11 to 7x7 and halved the stride size from 4 to 2. The results are shown in the second images where the “dead” feature maps and aliasing artifacts are not present.

The authors claim that dead feature maps are a result of the convolution size(11) and the aliasing artifacts are from the stride size. How did they draw these conclusions? They do not clearly describe how they reached these conclusions.

Testing against the MNIST benchmark they are able to achieve state of the art performance with their improved model.

Tweaking the model by removing either convolutional layers or fully connected layers only resulted in a mild degrading in performance, but removing both layers severely reduced performance.

They draw the conclusion that this result indicates the importance of the depth of neural networks. I don’t think this is a valid conclusion to draw. It’s possible that this model simply removed too many free parameters and could no longer fit the data well. A better experiment to draw this conclusion would be to test two models, each with the same number of free parameters, one wide and one deep.

To test for transfer learning, they ran the model against several different datasets. The model does extremely well, beating previous state of the art results in some and coming very close to state of the art in others. This would indicate that model generalized very well and can handle new, unfamiliar data.

Abstract:

The abstract for this paper is succinct and well written. It addresses the main goal of this paper, introducing visualization, and how they use it to improve current networks.

What others say:

To date, this paper has been cited 2096 times, indicating that it’s widely read and used in future works.

Viability as a Project:

The authors were quite detailed in their paper. Reproducing this would be possible, although they omitted some information such as specific learning rates and regularization parameters that would help reproducing their results.

Word I don’t know:

  • Hessian — A matrix that describes the curvature of a function in each direction.
  • Switch-variable — used to keep track of indexes where max pooling took place, I couldn’t find a good definition for this.
  • Euclidean distance(of vectors) — How far away two points are. On a 1 dimensional number line, 1 and 3 or 2 units away. Generalizing to N dimensions:
A general equation for distance
  • Aliasing Artifact — Error introduced by the technique which causes different signals to be indistinguishable
  • Aliasing — an effect that causes different signals to be indistinguishable
  • Artifact — Error introduced into a signal by the equipment or techniques used to process it
  • Hamming Distance — A measure of similarity between two vectors by comparing how many items in Vector 1 need to change to match Vector 2.(“e.g.: karolin” and “kathrin” is 3.)
  • Linear SVM

--

--