Deep Classifiers Ignore Almost Everything They See (and how we may be able to fix it)

Jörn Jacobsen, Jens Behrmann, Rich Zemel, and Matthias Bethge — 25.3.2019

Excessive Invariance: All images shown cause a competitive ImageNet-trained network to output the exact same probabilities over all 1000 classes (logits shown above each image).

Understanding what deep networks discard with depth has been a topic of heated debate in the recent past. A way to quantify this loss of information is measuring mutual information throughout the layers. Unfortunately, this quantity is intractable in high dimensions and thus it is yet to be shown what kind of information deep nets compress and how this is related to their success.

In this post, I am going to discuss an analytic approach to investigating classifier invariance and hence loss of information, that led us to the following surprising insights:

  • Deep classifiers are not only invariant to class-irrelevant variations, but also to almost everything humans consider relevant for a class; we term this property excessive invariance (see figure above for example)
  • Excessive invariance gives an alternative explanation for the adversarial example phenomenon
  • We identify the commonly-used cross-entropy objective as a major reason for the striking invariance we observed
  • There may be a way to control and overcome this problem …

The content of this post is based on our recent paper [1], which is going to be presented at ICLR 2019.

Exploring Invariances of Learned Classifiers

Investigating what a classifier does not look at, what it is invariant to, requires access to everything the classifier throws away throughout the layers. This is hard to do in general and has been subject of extensive study (e.g. [2]). Fortunately, recent advances in invertible deep nets have led to networks that do not build any invariance up until the final layer [3,4]. As everything but the final layer is a lossless 1-to-1 mapping, projecting from invertible representation to the class scores is the only place where invariance is created. What remains, is to simplify this final layer, so we can manipulate and investigate the pre-image of particular class scores.

We split the output of the invertible network into two subspaces: Zs represents the class scores and Zn everything not seen by the classifier.

To achieve this, we remove the final classifier from the invertible network and split its output into two subspaces: Zs and Zn (see figure).

  • The semantic subspace Zs: the logits, also often called class scores.
  • The nuisance subspace Zn: the remaining dimensions the classifier does not see.

The whole of Z has the same dimensionality as the input because we have an invertible network, Zs has as many dimensions as we have classes (1000 for ImageNet, 10 for MNIST) and Zn has dim(Zn) = dim(Z)-dim(Zs) dimensions.

Analytically Analyzing Logit Pre-images: Compute hidden representation for one image (left), throw away Zn, but keep logits Zs. Compute hidden representation for an arbitrary image from other class (right), throw away Zs, but keep Zn. Concatenate resulting Zs and Zn, invert the network and look at result!

This split allows us to compute a logit vector Zs based on one image and concatenate arbitrary Zn vectors from other images to it. We can then compute the pre-image of these activations and investigate the resulting inputs that would have corresponded to them (see figure above for illustration). The image we get from this procedure (question mark above) will cause the exact same probabilities over all classes, no matter which Zn we concatenated to the given Zs. Thus, this gives us a tool to investigate the decision-space of learned classifiers.

Top row: images from which logit vectors Zs are taken. Bottom row: images from which nuisance vectors Zn are taken. Middle row: resulting inverted images with identical logit configurations as top row images. We have analytically computed adversarial examples!
We have stumbled upon an analytic adversarial attack.

The figure above shows that, despite our hopes to learn about the decision space of the classifier, Zn dominates the image completely. The classifier, represented by the information encoded into the logits, seems to be almost completely invariant to any change of the input. We have stumbled upon an analytic adversarial attack. We can swap class-content arbitrarily without changing the predicted probabilities over 1000 ImageNet classes.

How is this Related to Adversarial Examples?

It is well-known from adversarial example research, that tiny perturbations of an input can change the output of a deep network completely. This shows how deep neural networks, despite their impressive performance, exhibit striking failures on slightly modified inputs. As such, adversarial examples are a powerful tool to analyze generalization of learned models under distribution shifts.

Identifying the root causes of these unintuitive failures and mitigating them, is necessary to train models that generalize well in real-world scenarios. For models to be robust to such distribution shifts they need to develop a holistic understanding of the tasks they are solving, instead of identifying the easiest way to maximize accuracy under the training distribution.

So far, most of adversarial example research has focused on small perturbations, as other types of adversarial examples are hard to formalize. However, bounded perturbation sensitivity only reveals a very specific failure mode of deep networks and needs to be complemented with other viewpoints for an understanding of the whole picture.

The classical viewpoint (short orange arrow): perturbation-based adversarial examples x* apply changes to an input x such that x* stays in the same ground truth class as x, while crossing the decision-boundary (dashed line) of the model. Our alternative viewpoint (long pink arrow): invariance-based adversarial examples x* apply changes to an input x that change the ground truth class of x*, without crossing the learned decision-boundary.

Our results above suggest we should also consider invariance in the context of adversarial examples. Norm-bounded adversarial examples investigate directions in which deep networks are too sensitive to task-irrelevant changes of their inputs. Our approach instead focuses on directions in which deep networks are too invariant to task-relevant changes of their inputs (see figure above for conceptual illustration). In other words, we investigated if we can change the task-specific content of an input without changing the hidden activations and decision of the classifier. And we can do this for any image arbitrarily.

Why are Deep Classifiers so Invariant?

To understand why deep classifiers exhibit the excessive invariance we have observed above, we need to investigate the loss function used to train them.

When training a classifier, we typically train it with the vanilla cross-entropy objective. Minimizing the cross-entropy between softmax of the logits and labels is equivalent to maximizing the mutual information between the labels and logits. Assuming there are multiple similarly predictive explanations for a given label in the classification problem, this objective encourages to only pick up on one of them. As soon as one highly predictive feature is used to make the prediction, the objective is minimized and there is no reward for explaining anything more about the task.

Left: Cross-entropy trained networks are easily attacked with our analytic invariance-based attack. Right: Independence cross-entropy trained model. Our attack is not successful anymore, it is only able to change the style of the digit, not its semantic content.

Solving this problem requires to change the objective function we use to train our classifiers. In our paper [1] we introduce an alternative to cross-entropy termed independence cross-entropy. This objective function gives explicit control over invariance in the learned representation. We theoretically and empirically show that this objective function reduces and in some cases solves the problems of invariance described above. An independence cross-entropy trained classifier cannot be attacked anymore by our invariance-based analytic attack (see figure above).

To show another piece of evidence, that deep classifiers are too invariant, we have created a dataset called shiftMNIST. At train-time we introduce a new predictive feature into MNIST digits. In one case it is a binary code (highlighted with red circles) being predictive of the digit label (a) and in the other case a background texture perfectly predictive of the digit label (b). At test time we remove or randomize the newly introduced features. In both cases, state-of-the-art classifiers turn to almost random performance at test time. They become invariant to the digit itself and only learn to look at the “easy” feature. Here again, our newly introduced independence cross-entropy allows to control the invariance and reduces the error by 30–40%.

If you found this interesting and are curious to understand more, please read our paper!

Main Reference:

Jörn-Henrik Jacobsen, Jens Behrmann, Richard Zemel, Matthias Bethge, “Excessive Invariance Causes Adversarial Vulnerability”; ICLR, 2019.

[1] Jörn-Henrik Jacobsen, Jens Behrmann, Richard Zemel, Matthias Bethge, “Excessive Invariance Causes Adversarial Vulnerability”; ICLR, 2019.
[2] Mahendran & Vedaldi, “Visualizing Deep Convolutional Neural Networks Using Natural Pre-Images”; IJCV, 2016.
[3] Jörn-Henrik Jacobsen, Arnold W. M. Smeulders, Edouard Oyallon, “i-RevNet: Deep Invertible Networks”; ICLR, 2018.
[4] Jens Behrmann*, Will Grathwohl*, Ricky T.Q. Chen, David Duvenaud, Jörn-Henrik Jacobsen*, “Invertible Residual Networks”; Under submission, 2019.