Paper Summary: Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/23.

Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks (2015) Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, Ananthram Swami

I read this one so you don’t have to. It’s not that there’s nothing of value in the paper, it’s more that the argument is unfocused, oddly precise where it doesn’t need to be and handwavy at other times, and more repetitive than most ML papers. I’ll try to pick out the good parts.

The rough idea is that existing fast methods for finding adversarial examples (§II.B) rely on looking at the gradient and either adjusting all pixels by a small amount (Goodfellow 2015) or adjusting a small number of salient pixels a large amount (Papernot 2015a). This is most effective when the gradient is large. There are theoretical reasons to expect that distilling a network (more on what this is in a minute) will tend to reduce gradients, encouraging the output to be nearly flat close to training examples. The authors tested this against MNIST and CIFAR10 classifiers, showing strong reductions in the susceptibility to adversarial examples (from 95.89% to 0.45% and 87.89% to 5.11%, respectively), very little loss in accuracy, and increased robustness.

What’s this distillation thing? Originally intended to compress models to run on cheaper hardware, distillation (§II.C) involves a typically smaller network learning another network’s output function. But instead of predicting hard class labels (0 or 1), the distillation network instead predicts the class probabilities generated by the first network. The full probability distribution contains more information than a single label, so intuitively we should be able to take advantage of this. The authors give this example from MNIST: a training example that has high probability for both the 1 and 7 labels “indicates that 7s and 1s look similar and intuitively allows a model to learn the structural similarity between the two digits.” (Emphasis in original.) In this paper rather than compressing the distillation into a smaller network, they train a distillation network of the same size as the original.

There’s a further detail to distillation, which is that — during training — the softmax layer includes a temperature parameter T:

Here the z are the activations of the last layer before the softmax. T = 1 is equivalent to standard softmax. Increasing T has the effect of pushing the resulting probability distribution closer to uniform. At test time the temperature is set back to 1. There’s an argument in §IV.B that higher temperatures also tend to flatten out the Jacobian, which is the real motivation behind this approach. The authors used T = 20.

There are two other main lines of discussion in the paper. One is on a robustness metric (§III.A) for neural nets building on Fawzi 2015, defining the metric as the expected minimum distance to an adversarial example (the expectation is over points drawn from the distribution being modeled). Defensive distillation improved robustness by 7.9x and 5.6x for the networks tested, according to this metric.

The other line of discussion (§IV.C) evokes the stable learning theory of Shalev-Shwartz 2010, arguing that the distilled network satisfies the criteria for an “asymptotic empirical risk minimizer” and having “epsilon(n)-stability”. There’s a theorem in Shalev-Shwartz about optimizers with the above properties that shows that they minimize generalization error in some optimal way. It’s not clear to me how this is connected to the more prosaic and squishy notion of generalization we usually talk about, and indeed the authors admit that this discussion is mostly for building intuition.

So this was another example of a defense against adversarial examples. I suspect that the downsides of the distillation approach are downplayed in this paper, and that there are ways around this particular defense. I’m sure we’ll be reading more about this in the upcoming selection of papers.

Fawzi, Fawzi, Frossard 2015 “Analysis of classifiers’ robustness to adversarial perturbations”

Goodfellow et al 2015 “Explaining and harnessing adversarial examples”

Papernot et al 2015a “The limitations of deep learning in adversarial settings”

Shalev-Shwartz et al 2010 “Learnability, stability and uniform convergence”