A Simple But Pretty Good Understanding of Adversarial Examples

Published in

Five Blog

17 min readJun 23, 2022

This post is about adversarial examples on deep networks, and what they represent. Whether you’ve never heard of them before, or whether you’ve spent years studying them (and already know that they represent “features”), you’ll hopefully come away from this with some new insights.

Above all, I’d like you to see them as being fundamentally simple. But also, to see how important they are in understanding what a network has actually learned about its inputs.

The Problem

Let’s quickly review what this problem is. An adversarial example on a network N which outputs a decision (e.g. classification) for an input I is a distortion D such that N(I) != N(D(I)). That is, distorting the input I with D makes N change its output. The reason that interest in this problem took off in computer vision¹, was this demonstration that, if N is a deep image classification network, then such a D can seemingly be found around any given real image I, for a very small D. And that idea that D is in some sense “small” is the one additional detail that completes our definition of an adversarial example.

D can take many forms², but the original and still most common one uses D(I) = I + ΔI, i.e. that the image is additively perturbed by a ΔI with a small norm, typically either the 2-norm or infinity-norm. In either case, it’s pretty easy to understand what we mean by “very small”. An infinity-norm bound is just a cap on the amount that any single pixel is allowed to change. We can construct adversarial ImageNet examples for which this cap is 1 or 2 out of a possible 256 intensity levels, making them literally imperceptible to humans, as in the following famous example:

GoogLeNet classifies the left image as “panda”, while classifying the one on the right as “gibbon”. The factor of .007 represents a change on the order of the least significant bit in the image representation. See Fig. 1 in original work.

The 2-norm, which measures the image perturbation as a vector length, can likewise be used to produce distortions that are clearly “small”, as in another famous example:

In each block of images, the first column is the original, the second is a visualisation of the adversarial perturbation, and the third is the perturbed image (or “adversarial example”). All three images in the third column of each block are classified as ostriches. See Fig. 5 in original work.

Taking a slightly more geometric view, we can make a conceptual drawing like this:

Despite the large distance between the natural images of the dog and ostrich, it turns out that the dog image I sits very close to the border between the two classes, meaning that there exists another image I + ∆I near to it which sits in the ostrich region of the classifier. There is nothing special about this particular dog image: for common deep classification networks, this property of sitting right next to a decision boundary turns out to be true of **all** natural images. Below, we will come to understand a bit better why that is.

A Simple Explanation

Before we go any further, let’s cut to the chase and state plainly what that little distortion actually represents: it’s a feature. By “feature”, we mean a visual pattern/signal in the input image which the network uses when classifying that image. That’s all that adversarial perturbations really are: patterns that are used by the network for classification, differently from the way that humans expect them to be. And the best part of adversarial examples as feature visualisations is that they inherently prove that they are what they claim to be. An adversarial pertubation doesn’t just claim to be a feature relevant for classification: it proves that it is by actually changing the network’s classification.

This fact has been stated in different ways at different times. For example, one could say that “adversarial examples are not bugs, they are features”. Or, as was stated before that, that “the features and functions thereof that [deep convolutional networks] currently rely on to solve the classification problem are, in a sense, their own worst adversaries”. Or, as was stated before that, that “we can think of all… approaches [to feature visualisation] as living on a spectrum, based on how strongly they regularize the model. On one extreme, if we don’t regularize at all, we end up with adversarial examples.”

If that dog in **I + ΔI** “is” an ostrich, then the difference vector ΔI is in some sense a local “ostrich-rather-than-dog feature”, ***by definition***.

What we’ve said above is sometimes regarded as something of a revelation. We don’t think that it needs to be. Here is an embarassingly simple way of seeing that it’s true:

What is a “feature”?

Let’s talk more about the idea that there are features in the input that are used for classification by the network. It’s pretty intuitive to imagine a network checking its input for visual evidence (features) of different classes, weighing that evidence up, and outputting the ID of whichever class appears to have the strongest support. To be useful, we expect that features will be more strongly associated with certain classes than others, at least in certain contexts.

But how can we make that high-level idea more concrete? Given an actual network, how do we extract an actual feature, associate it with an output class, and justify our claim that it represents what we say it does?

We’ve referred before to our network N outputting a decision: the class it assigns to its input image. We’ll now get more specific about the form we assume N to take as a function: we assume that it outputs a vector of numbers with as many dimensions as there are possible output classes, with each of those numbers representing the “score” Nₛ of the corresponding class s. The output class is then taken to be the one corresponding to the entry with the highest score. (To anyone familiar with modern deep classification nets, this should all sound pretty standard.)

Given this setup, then in a local sense (meaning around a particular image), why not just use the gradients of the class scores with respect to the input image as our features? By definition, they represent the directions away from a given image along which particular class scores increase the fastest, so in the neighbourhood of that image, they clearly fit our notion of what a “feature” is. This exact view was proposed and explored for a variety of pre-DL classifiers many years ago, and then applied to deep classifiers a bit later.

So writing down just a little bit of maths, our feature is ∇ₓNₛ, where x is the input image and Nₛ is the score of the network N for class s.³

How do you find an adversarial example?

We’ve already seen examples of adversarial perturbations above, but how are those actually calculated? Let’s design a basic but functional attack ourselves, right now.

We can start with a simple general formulation much like the one given here: we have a network N, and an input image x with true label y. We can then define a cost function J(N, x, y), which we, as the adversary, want to increase with respect to the input x. That is, we want to change the image to make the net predict the wrong thing.

Let’s say we want to change the label of our input from the true (source) label s to a different (target) label t: a “targeted” attack. Then a straightforward choice of J would be J ≡ Nₜ(x) − Nₛ(x). That is, we want to push the target class (say, “ostrich”) score up while pushing the source class (“dog”, perhaps) score down.

Now we just need an optimisation method to try to increase J. What could be easier than a single step of gradient ascent? We’ll take a step in the direction of the gradient with respect to the variable we’re optimising, the input image x:

x ← x + ε∇ₓJ = x + (ε∇ₓNₜ − ε∇ₓNₛ)

In fact, what we’ve just derived here is a version of the “fast gradient method” (FGM).

Hang on a second…

Let’s compare the expressions we just wrote for our adversarial attack update and our feature definition. On a moment’s inspection, we can see that our “FGM attack” is nothing but the addition of a local target-class feature and the subtraction of a local source-class feature! In the above image, it’s saying, “Take out some dog-ness and put in some ostrich-ness”.

That’s it. This is not an analogy. It’s not “like” it’s doing that. It is that. The fact that an adversarial perturbation is a feature is not something that has to be inferred or discovered. If we just look at the update formula that defines it, we see an old, well established notion of a feature sitting right there.

But that was just about the FGM attack. What about other attacks? Well, the very popular “projected gradient descent” (PGD) attack is the exact same thing, but iterated and kept to a certain maximum size. The “FGSM” attack is what you get when you just take the sign of the gradient before taking the step (to make the change in each pixel equal to ε, corresponding to an infinity-norm bound). The very effective and efficient DeepFool is, again, the same thing, but where the step length ε is varied at each step according to the gradients’ own estimates of where the scores should cross one another. These are clearly minor variations on one another. In fact, most popular attacks of this type are essentially the same thing.⁴ And you can point to the outputs of any of those attacks and define them as local features, using exactly the same interpretation that led people to use class-score gradients as features before we even began this discussion about adversarial examples on deep vision nets.

Simple, right? In fact, so simple as to raise a question: how did we end up studying this “problem” in the first place? Isn’t this kind of obvious and expected? Isn’t it normal?

Well, only sort of.

Why the Problem is a Problem

See, the remarkable part of this phenomenon isn’t the fact that you can perturb an image until its classification changes. If you’re sufficiently unconstrained in the additive perturbation you’re allowed to apply, then that fact is trivial. Consider Iₐ + (Iᵦ − Iₐ) = Iᵦ, where Iₐ and Iᵦ are real images of two different classes, and (Iᵦ − Iₐ) is the additive perturbation that changes Iₐ into Iᵦ. If Iₐ is an image of a cat, and Iᵦ is an image of a dog, then (Iᵦ − Iₐ) would represent one notion of “dogness” (between those two specific images), and adding it to the specific cat Iₐ trivially produces the specific dog Iᵦ.

No, the remarkable fact is the one that we had to add to complete our definition of adversarial examples at the beginning: it’s that the distortions turn out to be so small. That is, natural images always seem to end up sitting right next to the decision boundaries that classifiers draw between their class and other classes. And on top of that, the patterns often aren’t immediately comprehensible: they’re “small” not just in the sense of length, but in the sense of apparent meaning.

Why would something like this happen? Why would classifiers behave like this instead of just drawing more “normal” and arguably “better” boundaries, that correspond to something more like what we learn as humans?

The short answer is: because they tend to, and because they can get away with it (sort of). The longer answer is…

What It’s Trying to Tell Us

Let’s look at some actual adversarial examples, bearing in mind that they’re just features associated with one or more classes, and see what those examples are telling us about how the network has learned to classify. To make this especially clear, we’re going to look at “universal adversarial perturbations”, which are somewhat special in that their effect doesn’t depend too much on the image to which they’re applied. This helps us to look through the local specifics of adversarial examples computed at different points, and see the forest for the trees in terms of what the net has actually learned.

Take this group of perturbations, any of which transform most images into African grey parrots when added into them, according to GoogLeNet:

“Universal” perturbations that mostly target the African grey parrot class, even though they weren’t explicitly constrained to do so! See Fig. 5 in original work.

At a glance, that might seem bizarre. But zoom in and look closely, and you’ll see that the images are not meaningless noise at all. Deep nets are actually typically pretty robust to the addition of random noise! Rather, they’re “soups” of mid-level structures associated with the class of interest⁵: little pieces of parrot, scrambled over a dense, disorganised texture⁶. Once you know what you’re looking for, you’ll start to see the eyes, beaks, and characteristic ruffled feather textures, in what may initially have appeared to be a meaningless mess.

What this example is telling us is the following: if you want to increase the score of the African grey class (at the expense of other class scores), the best thing to do is to load the image with disorganised pieces associated with that class, because the net will recognise those and rapidly increase that class’s score, without being bothered by the fact that they don’t represent a coherent higher-level structure. Because that’s not what the net actually recognises.

But how is that possible? How could the net have managed to (mostly) successfully solve a relatively large-scale classification problem like ImageNet using a representation like this? Wouldn’t it have had to have learned something more sophisticated and, well, “correct” than that? Basically, no. This kind of thing is enough to take care of most of the problem on its own.

This is exactly what the BagNet paper was about. It’s not about adversarial examples per se, but it directly connects to the understanding we’re developing here.

The point of that paper was that, using a network architecturally constrained to only use deep learning at the level of relatively small patches (i.e. to score local features) and then add their (independent) scores up, you can match the performance of the AlexNet model that kicked off the deep learning revolution in image classification in the first place. The patch size required to do that is 17x17. Expand the patches to 33x33, and you don’t fall far short of VGG-16. That architecture, BagNet, conclusively demonstrates that any processing of higher-level structure is not necessary for achieving its level of performance, because it cannot do such processing by design.⁷

The BagNet architecture of this paper. The architecture can’t do anything but produce class scores for each patch as it scans the image, then add up the class scores of all of the patches to produce the final result. It has no idea where in the image each patch is when scoring it: all are treated equivalently. Despite this, it loses relatively little in performance compared to “real” CNNs trained on the same problem. See Fig. 1 in original work.

And conv nets⁸ turn out to be really good at solving problems that allow for those sorts of solutions. That property of CNNs has actually been studied directly. And while it seems possible to wean them off of that type of behaviour a bit, our current power to do that is limited. The obvious solution of just feeding adversarial examples back into the training set and forcing the net to learn that they should have the same labels as their source images, i.e. “adversarial training”, was proposed quite a while ago. And while there is some merit to the approach, the reality is that the classifiers it produces have much lower accuracy on standard datasets than the results people have grown accustomed to, tend to suffer from greater generalisation issues than standard nets do, and are only able to satisfy a very limited notion of “robustness” in any case. CNNs really “like” certain solutions more than others, and they don’t much like what adversarial training tries to teach them.

Simplifying our high-dimensional image space and nonlinear deep network down to a 2D image of a linear classification problem for the purposes of illustration and intuition, we can think of what’s going on something like this:

The naive picture of how different classes are being distinguished from one another might look something like this:

But that’s not the case at all. What the net is actually doing looks a lot more like this:

… where the normal to that boundary represents the pattern whose correlation with the input image the classifier thresholds to decide which side of the boundary it’s on.

Remember: a binary linear classifier is defined by the fact that it bases its decision, always and entirely, on whether the dot product w·x between the input x and the decision boundary normal w is above or below a given threshold. Taking a geometric view, the classifier is looking at the vector component of x in the direction of w (i.e. the projection of x onto w), and basing its decision on that vector’s magnitude: the orthogonal components, parallel to the boundary, are all irrelevant. From our conceptual viewpoint, that is a classifier that has a single feature w which it “checks” the input for, ignoring everything else.

Why does the picture end up looking like this one? Because the sorts of patterns that CNNs are learning to look for don’t correlate strongly with any natural image, and so no natural image is that far from the decision boundary. With nets that behave in this way, it’s not possible for a natural image to match the scores we’re able to generate synthetically. Natural images of recognisable objects don’t generally contain dense scrambles of specific low- or mid-level features: they are much more structurally constrained, and are in fact partly defined by such structure, to us. But those lower-level features do exhibit class-correlated statistical properties all on their own, and while the correlations are “weak” in one sense, they can actually be quite powerful if the goal is to tell a certain class apart from certain others.

That is, although no natural image ends up containing more than a small fraction of the (synthetically) possible total class score, natural examples of images of a certain class tend to contain a bit more of the patterns associated with that class than sample images of other classes do⁹. Returning to our picture of dog Iₐ and cat Iᵦ, (Iᵦ − Iₐ) is almost parallel to the linear decision boundary, but not quite. That weak correlation represented by the orthogonal component has quite a lot of discriminative power, even though it isn’t stable or robust. And it turns out that these are the kinds of solutions that conv nets tend to be good at finding.

This kind of statistical hair-splitting works a lot of the time. Even almost all of the time. But if it were totally satisfactory, we wouldn’t be here talking about this right now. And if we can insert our own signal into the image, then it doesn’t take much to accomplish our goal as the adversary.

Again, for this part of the explanation, we’ve been drawing simplified pictures of 2D linear classifiers. In light of that, let’s go all the way back to a very early paper on adversarial vulnerability and reconsider their explanation. There, the phenomenon of adversarial examples was attributed to nets being “too linear”, the basic idea being that the reaction of a linear net can be made to grow with the dimension of the input space without the input perturbation needing to grow accordingly (or at all, in the infinity-norm). What do we think of that? Well, deep net boundaries are certainly not linear at a higher-level view. They in fact tend to look like this:

Decision boundaries in 2D slices of image space. The x-axes represent a gradient direction from VGG-16, and the y-axis a randomly chosen direction. See Fig. 3 in original work.

Beyond that, even if we accept the assumption of linearity, points about its inherent brittleness in high dimensions can be stretched too far. You can see Section 3 of this paper for some nicely argued refutations of the idea that that setup per se leads unavoidably to the existence of adversarial examples.

But local linearity is nonetheless important in how we formulate and understand adversarial attacks, and thus represents an important aspect of how popular deep nets behave. As we’ve covered above, most popular attacks are just variants on gradient descent, meaning that they inherently rely on locally linear approximations of the true function. And since attack perturbations are typically very small in magnitude, and often found in a single iteration, that linear approximation is generally good over the interesting region: the one between the image and the boundary. Linearity is an important concept in understanding adversarial examples, it’s just that it’s also important to keep it in context. This paper goes a bit further into explaining what we can do entirely with linear explanations, and part of why we get nonlinear boundaries like the ones depicted above.

Simple, But Complicated

Now, just because this phenomenon is quite simple at its core doesn’t mean that our work is done. The characterisation of the inductive biases of the networks and the statistical properties of the training data that lead to this kind of behaviour take us down a deep rabbithole. We are not yet done, as a community, characterising why it is, of all of the solutions to the problem that exist, that these are the sorts that our current networks inevitably find, and why it is that efforts to steer them to different solutions prove much harder. But hopefully, this has shed some light on the situation that we currently have.

And now that you understand all of this, we can talk about how well adversarial perturbations (which are just features) transfer between networks. That is, we can talk about how similar different deep nets trained on the same problem end up being to one another. Our next post will be all about that.

[1] Before being studied in computer vision, a version of this problem had already come up in other machine learning contexts, including spam and malware filtration. See e.g. this, this, and this. An adversarial attacker in that context is someone delivering a malicious payload while exploiting the filter’s naïve notion of what malicious inputs look like in order to bypass it. It goes without saying (despite my saying it) that this problem has real-world consequences.

[2] As the study of adversarial examples has progressed, people have also considered e.g. warps and anything verified by a human to be semantically irrelevant, where the “smallness” of D is defined accordingly.

[3] We’ve also assumed here that the gradient is in fact defined for net N, i.e. that Nₛ is in fact a differentiable function of its input x. This will also seem standard to anyone already accustomed to dealing with these networks, and, as we will see below, is a common assumption of adversarial attack methods. But even if this quantity doesn’t technically exist, we can use numerical methods to compute a quantity that basically represents the same thing.

[4] “With apologies to the many authors that have written papers on different attacks, most of these are variants of PGD for different norm bounds.” (Slide 21)

[5] When the original work was published, some of what we’re discussing here wasn’t as clearly understood, so the class-targeting aspect of universal adversarial perturbations was noted as a curious fact that emerged unexpectedly. They actually discovered UAPs before characterising how they did what they did. For that, you can see this and this. UPSET and TREMBA are both techniques for learning how to inject class-specific adversarial patterns which rely on this underlying principle.

[6] Where one draws the line between a “texture” and “a disorganised mess of mid-level features” is a matter of semantics. You can take the image itself as an example of how we’re defining a texture here.

[7] Note that this isn’t the same thing as claiming that using higher-level structure cannot further improve performance on ImageNet, or even that some deep networks are not already doing that to some extent. As performance increases, this becomes a more subtle point: look closely at the results in the paper to understand what performance is empirically added by modern architectures.

[8] Note that we’re mostly focusing on CNNs here, but this is not to claim that these vulnerabilities are completely unique to them. It’s just that that’s the case that we’ve had time to study in great depth. Recent work has got into looking at Transformers as well and establishing a basis for comparing the robustness of the two architecture families.

[9] Consult the heatmaps in the BagNet paper, e.g. Figs. 2 and 4, to get a sense of what this looks like on real images, and how it can involve an interplay between the strength of local evidence and the spatial support of it over the whole image. Fig. 4 gives a good idea of how this can go wrong in practice.

Acknowledgements

Thanks to John Redford for very helpful feedback on earlier drafts.

Besides the images reproduced from the respective authors’ original papers for the purposes of commentary (as noted in the captions), we would like to acknowledge the participation of Struthio molybdophanes, Smudge the Cat, and Gotham’s favourite low-resolution antagonist.