Paper Summary: Explaining and Harnessing Adversarial Examples

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/22, with better formatting.

Explaining and Harnessing Adversarial Examples (2015) Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy

By now everyone’s seen the “panda” + “nematode” = “gibbon” photo (below). It’s an unfortunate feature of modern image classifiers that a small, well-crafted perturbation to an input image can cause an arbitrarily targeted misclassification. Direct access to the model isn’t even necessary — inputs that are adversarial on one network are likely to be adversarial on another, so an attacker can craft malicious inputs offline. Nor is the problem limited to image-shaped input: spam filters and virus detectors are classifiers too and are — at least in principle — open to similar attacks.

In the last few years the problem of adversarial examples has transitioned from curiosity to persistent thorn in the side of ML researchers to, well, something more complicated. I’m particularly interested in how adversarial examples illuminate the black box that is neural networks, and more importantly point to fundamental questions about what it means to build robust systems that perform reasonably and safely in the real world. This will be the first of several summaries of papers on adversarial examples, starting off where it all began (sort of), with Goodfellow 2015.

Adversarial examples were actually described long before Goodfellow 2015, and indeed there was another paper that got some attention the year before (“Intriguing properties of neural networks” Szegedy 2014). Previously there was speculation that deep networks were particularly susceptible to adversarial attacks, and that this was due to their non-linear characteristics. In this paper the authors argue instead that it’s the linear nature of modern deep networks, the very thing that makes them easy to train, that makes them susceptible. The authors also present a fast way to generate adversarial examples, introduce an adversarial training method, and show that this is an effective regularizer.

The paper starts with an argument that we should expect linear models in high dimensions to have adversarial examples close to training data. For images with 8-bit color channels, e.g., we expect changes in pixel values of 1/255 not to affect how an image is classified. After all we’re throwing away information that’s finer grained than this and this is somehow ok. Speaking more precisely, we’d like a perturbed input not to affect classification if the perturbation has infinity-norm ≤ 1/255. Yet this intuition breaks down in high-dimensional spaces. In a linear model, the input (and therefore the perturbation) will be multiplied by some n-dimensional weight vector w (with average element magnitude m). The magnitude of the perturbation’s dot product can be as large as mn/255 in the worse case of choosing the perturbation to be sign(w). This has a good chance of crossing a decision boundary.

Another way to look at this: the low order bits are usually unimportant because they are essentially random and their contributions will tend to cancel. At least they’re random for non-adversarial input. But if you align them all in directions that have the most effect on the output, the effect will add up.

Much early concern over adversarial examples was about deep networks, but shallow linear models are susceptible. And radial basis function (RBF) networks, which are highly non-linear, are highly resistant to adversarial examples. We choose linear or linear-ish components for modern networks like ReLU and LSTM precisely because they make the network easier to train. Even sigmoids, though non-linear, are carefully kept towards the roughly linear central part of the curve for the same reason. This all suggests a method the authors call the “fast gradient sign method” for finding adversarial examples: evaluate the gradient of the loss function wrt the inputs, and perturb the inputs by epsilon * sign(gradient). See below for this method applied to a logistic regression model trained on MNIST 3s and 7s:

Left to right: the model weights, the maximally damaging perturbation, and the perturbation applied to some examples with epsilon = 0.25. The error rate with this perturbation was 99%. (Note that the perturbation was in inverted on true 3s to make them classify as 7s.)

Interestingly the fast sign gradient method can be pulled directly into the loss function as a way to do adversarial training (see §5 and §6 in the paper for details). For logistic regression this looks a bit like L1 regularization, but doesn’t hurt the final accuracy in the same way. For deeper nets it makes sense to train on generated adversarial inputs (following Szegedy 2014) in addition to including the adversarial loss in the objective. This can make training more difficult: for their MNIST maxout network they also had to increase model capacity and adjust the early stopping criterion, but overall adversarial training improved final model accuracy! It also increased robustness: the error rate on adversarial examples went from 89.4% to 17.9% — which is much better, but still far from perfect.

The authors turn at this point to why adversarial examples generalize. The claim is that for linear models, adversarial examples lie in linear spaces — the direction of a perturbation is the most important thing, not the magnitude. And further that modern neural classifiers generally resemble some reference linear model, especially outside the thin manifold of training data (the analogy is to Potemkin villages with elaborate facades and nothing behind). They test this by comparing misclassifications on different architectures based on roughly linear models (shallow softmax and maxout) and also on the highly non-linear RBF. The shallow softmax agreed with maxout on 84.6% of its misclassifications (I’m assuming this is still MNIST, but regardless this is quite substantial), whereas the RBF only agreed on 54.3% (which is still a surprising amount of agreement). This felt a bit handwavy to me, but I also didn’t follow all of the discussion.

They also take aim at what turn out to be flawed hypotheses about adversarial examples: that generative training should be confident only on “real” data (they provide a counterexample) and that individual models have uncorrelated quirks that should be fixable with ensembles (ensembles do help, but not much).

And finally they contextualize adversarial examples as a subset of a much larger space they call “rubbish class examples”, random-looking inputs that the classifier happily and confidently classifies as a specific class. We’d like an ideal classifier to give low class probability to these inputs (max-entropy uniform probability if using softmax, all low probability if using separate classifiers per class). But linear (or linear-like) models like to extrapolate and don’t know how to calibrate their confidence level the way we’d like.

This has been a general overview of the problem of adversarial examples. Over the next few papers we’ll cover some of the cat and mouse game of finding defenses and getting around them, as well as some more theoretical models that yield some surprising results.

Szegedy et al 2014 “Intriguing properties of neural networks”