Paper Summary: Towards Deep Learning Models Resistant to Adversarial Attacks

6 min readNov 30, 2018

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/29.

Towards Deep Learning Models Resistant to Adversarial Attacks (2017) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu

This paper is an attempt to cast previous approaches to the problem of adversarial examples into a single, somewhat more formal framework, arguing that all possible strong attacks that use gradient information (e.g. those developed by Carlini and Wagner 2017) are roughly equivalent to projected gradient descent (PGD), and that this attack is in some sense universal. The authors also demonstrate empirical robustness results and relate these results to questions of model capacity and transferability. They have posted public challenges at https://github.com/MadryLab/mnist_challenge and https://github.com/MadryLab/cifar10_challenge.

This is in some sense a philosophical paper, in that — in my view — the main contribution is the attempt to define and formalize the current paradigm. The authors observe that existing approaches, both attacks and defenses, take on some part of a single minimum-maximum (“saddle point”) nested optimization problem. Naive classification is normally construed as empirical risk minimization: just minimizing the expected loss for input-output pairs drawn from a data distribution D, which ceases to perform well once inputs are chosen adversarially. To analyze the adversarial scenario, the authors define an attack model: given an input x the attack model defines a set S of allowed perturbations that an adversary can choose from (in this case an ℓ∞ ball — i.e. a hypercube centered at x). Since adversaries can choose any perturbation in S, it now makes sense to reconsider risk minimization as the nested saddle point problem

Here L is a loss function parameterized by θ. In words, find the parameters θ that minimize the expected adversarially perturbed loss (which itself is found by maximizing the loss wrt the perturbation δ). One thing to note, and this is not specifically addressed in the paper, is that x and y are still drawn from the data distribution D, and so this might not be the relevant threat model for real-world attacks.

The above is a unified view of the problem space, the authors argue. E.g. the fast gradient sign method (FGSM) represents a single step in the inner optimization. FGSMk, or iterated projected gradient descent (PGD), merely represents a stronger inner optimizer. Adversarial training on FGSM-generated examples, as a defense, approximately optimizes the outer minimization — and strengthening the quality of the adversarial training examples is a natural next step.

The authors now turn to solving the saddle point optimization using MNIST and CIFAR as test cases. Contra prior claims that the inner maximization is intractable (Huang 2015, Shaham 2015), the authors present empirical evidence that, while global maxima cannot in general be found, local maxima are all of similar quality. (This is like the observation of the more general case of local minima in deep neural nets, now accepted as conventional wisdom.) Starting PGD from 1e5 random points in the ℓ∞ ball yielded a tightly concentrated distribution with no observed outliers. This doesn’t rule out the existence of points with larger losses, but it suggests that any such points are hard to reach with first-order methods.

It’s another question entirely whether it’s reasonable to use gradient information from the inner (maximization) problem as a valid update in the outer (minimization) problem. That is, can we in practice use adversarial examples as training points to train a robust classifier? And is this the correct choice in principle? This does work out empirically, but there’s also a principled argument — Danskin’s theorem says that gradients of inner optimizations can be used to optimize saddle points. Strictly speaking Danskin covers only the continuous case and requires the functions to be differentiable everywhere (and ReLU and max pooling break these assumptions), but this does look like a promising direction (the details are in appendix A).

A major result in this paper is the effect of model capacity on the ability to learn robust classifiers. The claim is that classifiers must be significantly higher-capacity to be robust to adversarial examples. I found the simplified illustration of figure 3 helpful:

The left image includes just (benign) training examples, in which case the decision boundary is simple, here just a straight line. The middle image adds regions of adversarial perturbation, some of which extend past the simple decision boundary, leading to adversarial examples (red stars). The robust boundary (red squiggle, right) that separates the regions of adversarial perturbation must be more complex. This formulation does assume that there are no overlapping ℓ∞ balls of different class, and thus that a good robust boundary can be drawn at all.

The authors looked at models of varying capacity and different levels of adversarial training (none, FSGM only, and PGD). The conclusions they drew were:

Capacity alone helps (a little bit)
FSGM training only helps against FSGM-generated adversarial examples, but not against stronger attacks: models with the capacity to overfit do; in the natural image case, adversarial training hurts smaller models
Small-capacity models can’t classify natural images at all when trained with PGD adversaries (large-capacity models can)
Both capacity and strong adversarial training protect against the black-box transfer of adversarial examples.

Following are the results with PGD adversarial training. On MNIST (natural accuracy 98.8%), white-box attacks reduced accuracy to 89.3% and black-box attacks were less successful (95.7% accuracy). On CIFAR (natural accuracy 87.3%) the results were 45.8% (white-box accuracy) and 64.2% (black-box accuracy). Of note, the robustness guarantees only hold for perturbations up to a certain size and for the ℓ∞ norm. To test these limits the authors tried different values of ε (perturbation size) and also ℓ2 perturbations. See the plots below for these results (note that the accuracy for MNIST especially drops off quite quickly).

The appendices are quite extensive. I found the most interesting to be appendix C, on inspecting the MNIST model’s convolutional layers. In the robust models the first convolutional layer only had 3 active filters! This amounts to making three copies of the input image, thresholded at three different values. Another unusual feature of the robust models was that the softmax output biases were skewed, presumably representing a reluctance to predict classes that are more vulnerable to being targeted by adversarial inputs. The authors point out that these (completely learned) features seem like reasonable approaches a human engineer might take to make a network more robust, but that when they tried these changes manually their network was absolutely vulnerable to attack.

I really like this paper, especially the attempt to ground the problem in a more solid theory. I am concerned, as I noted above, that the expected value threat model doesn’t quite match the worst case nature of real-world adversarial environments. I also wonder how reasonable the ℓ∞ norm is as a metric. And also to what extent PGD trained networks are robust to more sophisticated distortions — those that look normal to a human eye, yet are very large when viewed by the ℓ∞ norm — my guess is not at all. [Update: it looks like this paper had a similar objection.]

Carlini and Wagner 2017 “Towards evaluating the robustness of neural networks” https://arxiv.org/abs/1608.04644

Huang et al 2015 “Learning with a strong adversary” https://arxiv.org/abs/1511.03034

Shaham et al 2015 “Understanding adversarial training: Increasing local stability of neural nets through robust optimization” https://arxiv.org/abs/1511.05432

Paper Summary: Towards Deep Learning Models Resistant to Adversarial Attacks

Written by Mike Plotz Sage