Paper Summary: Universal adversarial perturbations

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/25.

Universal adversarial perturbations (2016) Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, Pascal Frossard

Adversarial perturbations, to this point, have been image-specific. That is, given an input image one needs to apply a perturbation that would take that image to a different class. Perhaps surprisingly, there are universal perturbations that can be applied to almost any image to cause misclassification. This applies to all modern classifiers, even when specifically protected against this attack. This paper shows how such a thing can be done. The perturbation is what the authors call “quasi-imperceptible”, which is to say quite perceptible if you’re looking for it. Despite this I think the result is real cool, especially since it jogs some intuitions about how these classifiers work.

We’d like to find a single universal perturbation that reliably causes misclassifications. Note that we don’t require strict universality, only that “most” perturbed inputs are misclassified. The authors call this proportion the desired fooling rate; the proportion actually achieved on the validation set is called the fooling ratio. (One wonders whether the choice of terminology was influenced in any way by the lead author’s name.) The approach is to take a small set X of images sampled from the training distribution. For each image x, find the minimal perturbation that sends x to the decision boundary. These perturbations are accumulated into a candidate universal perturbation, which is constrained to have a sufficiently small size (Lp norm for p = 2 and p = ∞).

Some details on the algorithm: empirically X doesn’t have to be very large (some numbers below). To solve the optimization problem of finding the minimal perturbation, the authors use “Deepfool” (Moosavi-Dezfooli 2016 — again, love the nominative determinism). I haven’t looked too closely at this, but it seems cool — it’s slower than Goodfellow’s fast gradient sign method but yields qualitatively better results. The accumulation of perturbations is done by iteratively adding in the latest perturbation and projecting to the Lp ball of radius ξ (the constraint on the size of the perturbation).

Some example perturbations, scaled up to be easily visible:

Interestingly this approach exhibits impressive levels of generalization. As you’d expect, the fooling ratio is higher for the sample set X than for the validation set, but only slightly (e.g. 82.9% vs 82.0% on GoogLeNet). The authors used X = 10,000, but as small as 500 yielded significant results (30% fooling ratio on GoogLeNet) — this amounts to less than one sample image per ImageNet class! As impressive is the transferability across architectures, with fooling ratios ranging from ~40% — 74%. I expect that this is sensitive to the similarity in training data, and that networks trained on different natural image datasets might not fool each other as well.

Another curious feature is the topology of the fooling graph (i.e. images with label A typically misclassify as label B). The graph looks like a collection of disjoint components, mostly pointed inward to a single dominant label (see below). The full diagram is in the appendices.

A sobering, if unsurprising result came from the fine-tuning experiments. The authors trained the network on universally perturbed adversarial examples (they actually used adaptive perturbations, keeping the direction fixed and varying the magnitude to avoid overfitting). This reduced the fooling rate from 93.7% to 76.2%, which sounds like a good start. But further iterations of this procedure were actually worse than the first iteration.

The most interesting part of the paper is the discussion on what universal perturbations tell us about the shape of decision boundaries in image classifiers. Aside from the observation, noted above, that small sample sets generalize surprisingly well, the authors also compare universal perturbations to random noise perturbations. They note first that the magnitude of random perturbation necessary to cause misclassifications was at least an order of magnitude larger than for the universal perturbation. Additionally the singular values of the matrix of normal vectors at the decision boundary was atypically distributed, indicating that there was likely a low-dimensional subspace that contains most normal vectors near natural images. To test this, they sampled random vectors from the subspace spanned by the first 100 singular vectors. These vectors achieved a fooling ratio of 38% (compared to only 10% on fully random vectors). Super cool! A rare look into the hidden inner workings of deep nets. (This paragraph may not have made a ton of sense, since I had to compress a lot of somewhat complicated stuff; if you want to know the details go ahead and read §4 of the paper.)

So, cool result, a little flashy, probably not too useful, but heavy enough on the insight that I found it valuable. Side note, this paper was a pleasure to read, clear and well organized. I’ve noticed there’s wide variation in writing quality in these papers (sorry, Papernot) and the quality makes a big difference in how quickly the reader can understand the material.

Moosavi-Dezfooli et al 2016 “Deepfool: a simple and accurate method to fool deep neural networks”