# Paper Summary: **Synthesizing Robust Adversarial Examples**

Part of the series *A Month of Machine Learning Paper Summaries*. Originally posted here on 2018/11/27.

Synthesizing Robust Adversarial Examples (2017) Anish Athalye, Logan Engstrom, Andrew Ilyas, Kevin Kwok

This paper is the one that produced the 3D printed turtle that, when viewed from almost any angle, modern ImageNet-trained image classifiers misclassify as a rifle (see video below). This work was done roughly concurrently with Eykholt 2017, which also deals with adversarial examples in the real world, but the approaches are different enough that I think it’s worth looking at both papers. So look out for another summary on this topic tomorrow.

Early on, adversarial examples were considered theoretically interesting, but were hard to take seriously. Small changes in lighting, camera noise, and viewpoint would destroy adversarial misclassification, and this fragility was thought to be fundamental to the nature of the attack. This viewpoint has started to change with new work on bringing adversarial examples into the physical world, but this paper is the first major result (along with Eykholt 2017) that demonstrates the extent of the risk of this threat model. In contrast to Eykholt 2017, this paper develops a general method for incorporating transformation invariance, given a distribution of transformations of the input object.

Incidentally, the idea of making adversarial examples more robust to me looked a lot like C&W’s work on defeating several defensive measures. Indeed, there are marked similarities in the approach.

The approach taken here is called Expectation Over Transformation (EOT), which produces adversarial examples that are robust — that is, effective over an entire distribution of transformations. Generating 3D adversarial physical examples is just a special case of EOT. The idea is relatively straightforward: where current methods try to maximize the log-likelihood of the target class near an input, EOT extends this to maximize an expectation of the log-likelihood given transformed inputs.

In a little bit more detail, we start by choosing a distribution *T* of differentiable transformations. Each *t* in *T* maps *x*’, the input controlled by the adversary, to *t*(*x*’), the input as perceived by the classifier. Where previous methods minimize the *Lp* distance between the original *x* and the adversarial *x*’, EOT’s distance metric is applied to the transformed points (the idea here is that distance in the transform’s codomain more closely tracks human perceptual distance). The expectation of the distance and target class log-likelihood are jointly optimized (in Lagrangian form) with SGD — *t* is sampled from *T* in each forward-backward pass, and since *t* is differentiable it’s backpropagated as per usual.

The choice of the distribution *T* depends on the use case. For 2D printed patches (see also Brown et al 2016’s “adversarial patches”) the authors use random affine transformations. For 3D they use a rendering pipeline that takes a texture and 3D object shape and generates an image under varying lighting conditions and camera angles. EOT requires differentiability and the authors were able to extract a differentiable mapping from screen space to texture space from the renderer. (They don’t provide a lot of details here, but it seems to me that an approach similar to the one from Spatial Transformer Networks ought to work.)

Another detail is the distance metric used. The authors use the L2 norm in LAB color space, which was developed to be “perceptually uniform”, meaning this distance is roughly equal to human perceptual distance.

Results were evaluated on InceptionV3 and tested both in simulation and in the physical world. The evaluation metric is an *adversariality score*, the probability that an example is classified as the target class, averaged over a sample of transformations drawn from *T*. The 2D case had a score of 96.4% (with a mean L2 norm of 5.6e-5, which sounds pretty small, though some of the perturbations are clearly visible, e.g. “caldron” / velvet). The simulated 3D case had an adversariality score of 83.4%, though again some of the examples in the appendix have clearly been messed with (the speedboat / crossword puzzle and the orange / power drill are the clearest).

The authors produced two physical adversarial examples, one turtle / rifle, one baseball / espresso. The turtle was classified as a rifle 82% of the time and correctly classified only 2% of the time; for the baseball those numbers are 59% and 10%. The authors note in the discussion — and this should not be surprising — that the larger the space of transformations the larger the perturbation required to reach adversarial levels.

There was some discussion of potential improvements to the basic method. For one, the results could probably be improved with a more sophisticated understanding of human perception and better associated metrics. Another improvement for the physical world examples might be made in modeling printer output, explicitly correcting for the printer’s color gamut rather than using noisy color mapping transformations. This wasn’t mentioned in the paper, but I also get the sense that strategically choosing the source and target could make adversarial attacks much stronger, based on the observation that the distortions in some examples were far more visible than others.

So nice formulation, cool result, even a little scary! I certainly wouldn’t want to carry that turtle/rifle around, especially as surveillance with image classification becomes more and more ubiquitous. I wonder if (or when) this will become subject to regulation — will there be turtle open carry laws? Will there be 3D printed guns that misclassify as turtles? Who will be the first person to get an adversarial patch tattoo, and what class will it target?

Brown et al 2016 “Defensive distillation is not robust to adversarial examples” https://arxiv.org/abs/1607. 04311

Carlini and Wagner 2017 “Towards evaluating the robustness of neural networks” https://arxiv.org/abs/1608.04644

Eykholt et al 2017 “Robust Physical-World Attacks on Deep Learning Models” https://arxiv.org/abs/1707.08945