Paper Summary: Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods

6 min readNov 26, 2018

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/26, with better formatting.

Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods (2017) Nicholas Carlini, David Wagner

This paper is like eight papers in one: Carlini and Wagner look at 10 defensive strategies against adversarial examples from 7 different papers and try to break them. So we’re biting off a lot here. The background is that attempts to correctly classify adversarial examples have mostly failed, so the community has mostly fallen back on just detecting adversarial inputs. C&W (as the authors call themselves) show that even this is quite difficult and that most such approaches can be defeated by a “zero-knowledge” attack.

C&W consider three threat models (following Biggio 2013):

A generic “zero-knowledge” attack; the attacker isn’t even aware that there’s a detector in place
“Perfect-knowledge” / white-box attack: the other extreme, in which the attacker has access to intimate details (architecture, model parameters) of the detection system
“Limited-knowledge” / black-box attack: the attacker knows which detection scheme is in place, but doesn’t have access to model details — somewhere in between the other options

A successful zero-knowledge attack clearly works against the other two scenarios, so this is checked first. Perfect-knowledge attacks can sometimes be adapted to the limited-knowledge situation by building a substitute network and performing a white-box attack against the substitute. C&W point out that limited-knowledge attacks are only interesting if zero-knowledge attacks fail and perfect-knowledge attacks succeed, so these are tried last.

The C&W attack is from the author’s other paper (C&W 2017 “Towards evaluating the robustness of neural networks”), explained briefly in §2.6. The idea is to construct a loss function, given an input x, consisting of L2 distance to x and a constant c times another loss function ℓ:

ℓ(x′) = max(max{Z(x′)_i : i ≠ t } − Z(x′)_t , −κ).

Let’s break that down a bit. Z(x′)_t is the target class logit for a candidate point x′, max{Z(x′)_i : i ≠ t } is the most probable non-target logit, and κ is a tunable confidence threshold. The idea is that we want the target class to be some amount more likely than the next most likely class, but not maximally likely — this would fight directly against the L2 norm constraint. The loss function is minimized with gradient descent and the constant c is found through binary search. The result is a higher-quality adversarial example than the cheaper fast gradient sign (FGS) method (Goodfellow 2014) and JSMA (Papernot 2016).

C&W evaluate 10 defensive approaches from 7 papers. Each defense is first evaluated on the FGS and JSMA baselines to ensure is was correctly implemented — these simple attacks should be defeated. Then the defense is attacked in earnest with the stronger C&W attack described above.

The 10 approaches fall into roughly four categories, the first of which is secondary classification — that is, using a neural network to detect adversarial examples directly. Grosse 2017 add a new class to their network specifically for adversarial examples. Gong 2017 train a separate binary classifier. Both defenses do well against C&W’s zero-knowledge attack… on MNIST. But do much less well on the larger, more complex CIFAR dataset. The perfect-knowedge attack completely overruns the defenses, though for Gong 2017 the attack required a loss function modified to use logits from both networks (the discussion of the choice of loss function is interesting, but out of scope of this summary — see the second half of §3.1).

Metzen 2017 go further and use internal convolutional activations as inputs to their detector. Yet this defense did little better than Gong 2017 against a zero-knowledge attack and utterly failed against perfect-knowledge and limited-knowledge attacks.

The second category is PCA and dimensionality reduction. Hendrycks and Gimpel 2017 used PCA on the input images and notice that natural and adversarial examples have a visibly different distribution over the later components. So they build a classifier that looks at these lower-order PCA components. The defense works against a zero-knowledge attack on MNIST. But this turns out to be mostly about the MNIST dataset, specifically that border pixels are almost entirely zero in natural MNIST images. This is not the case on CIFAR, and in this case the defense fails.

Bhagoji 2017 use dimensionality reduction, throwing away all but the first 25 principle components. This is a successful defense in zero-knowledge scenarios. In perfect- and limited-knowledge scenarios, though, the approach is even less robust than a standard undefended CNN.

Li and Li 2016 use a scheme that involves applying PCA to the convolutional layers of the classifier, then passing the result into a cascade of linear SVMs. This defense fails in the zero-knowledge scenario (mostly because of extremely high false positive rates).

The third category is classical statistical approaches to look at the distribution of adversarial examples. Grosse 2017 uses a Maximum Mean Discrepancy (MMD) test, but this fails in the zero-knowledge case. Somewhat more promising is Feinman 2017’s kernel density estimation (KDE) approach, which uses activations from the final hidden layer of the classifier. On MNIST the zero-knowedge attack fails, and interestingly even the perfect-knowledge attack fails, which the authors note is similar to Goodfellow 2014’s results with RBF networks. The defense does not work on CIFAR, though.

The fourth and final category of defense is randomization and blur. Feinman 2017 again mount the strongest defense with a method called Bayesian neural network uncertainty. This is based on the simple idea of adding randomness (in this case with dropout), hoping that classification of natural images will be impacted less than adversarial inputs, which certainly seems plausible. The defense turns out to work against zero-knowledge attacks, and even does reasonably well against a first attempt at a perfect-knowledge attack (which used a similar loss function adjustment to the attack against Gong 2017). A second perfect-knowledge attack that sampled several networks with different dropout randomness did succeed, but required the largest input perturbations (visible on MNIST, though not on CIFAR). Limited-knowledge attacks were similarly difficult. From the paper: “We consider this the only defense we evaluate that is not completely broken” — high praise.

Li and Li 2016 also propose a 3x3 blur over the input. Zero-knowledge attacks needed to use high-confidence adversarial examples (high κ), which increases distortion. Perfect-knowledge attacks defeat this defense without additional distortion.

From these evaluations it seems that randomization is most promising, while detection networks are most easily bypassed. Statistical operations on raw pixels don’t work well, which shouldn’t be too surprising since the whole point of using convolutional networks on images is to get at more complex macro-level features (KDE was somewhat successful, but this is because it used the last hidden layer). Other takeaways: MNIST is not a good indicator of performance on more complex datasets. Also, robustness to simple attacks like the fast gradient sign method and JSMA should not be considered sufficient; rather, defenders should validate their approaches against stronger attacks. So there you have it. Attackers seem to have the upper hand at this point.

Bhagoji et al 2017 “Dimensionality Reduction as a Defense against Evasion Attacks on Machine Learning Classifiers” https://arxiv.org/abs/1704.02654

Biggio et al 2013 “Evasion attacks against machine learning at test time” https://arxiv.org/abs/1708.06131

Carlini and Wagner 2017 “Towards evaluating the robustness of neural networks” https://arxiv.org/abs/1608.04644

Feinman et al 2017 “Detecting Adversarial Samples from Artifacts” https://arxiv.org/abs/1703.00410

Gong et al 2017 “Adversarial and Clean Data Are Not Twins” https://arxiv.org/abs/1704.04960

Goodfellow et al 2014 “Explaining and harnessing adversarial examples” https://arxiv.org/abs/1412.6572

Grosse et al 2017 “On the (Statistical) Detection of Adversarial Examples” https://arxiv.org/abs/1702.06280

Hendrycks and Gimpel 2017 “Early Methods for Detecting Adversarial Images” https://arxiv.org/abs/1608.00530

Li and Li 2016 “Adversarial Examples Detection in Deep Networks with Convolutional Filter Statistics” https://arxiv.org/abs/1612.07767

Metzen et al 2017 “On Detecting Adversarial Perturbations” https://arxiv.org/abs/1702.04267

Papernot et al 2016 “The limitations of deep learning in adversarial settings” https://arxiv.org/abs/1511.07528

Paper Summary: Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods

Written by Mike Plotz