Paper Summary: Adversarial Spheres

Mike Plotz Sage
6 min readNov 30, 2018

--

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/30.

Adversarial Spheres / The Relationship Between High-Dimensional Geometry and Adversarial Examples (2018) Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S. Schoenholz, Maithra Raghu, Martin Wattenberg, Ian Goodfellow

Of the adversarial examples papers I’ve summarized this month, this is the most out there. And in some ways the most exciting. It’s not about getting concrete results on real-world datasets. Instead the authors look at a contrived synthetic dataset, both analytically and empirically, and show a fundamental — and model-independent — limit on robustness. The claim is that while previous work in the space has looked for flaws in the models we use, a more parsimonious explanation of adversarial examples lies in the nature of high-dimensional data in the presence of any classification errors. The problem is just another manifestation of the curse of dimensionality.

They consider two metrics of interest for a data distribution p(x) with error set E (the set of points that are misclassified by some classifier). The first metric is µ(E), the measure of E over the data distribution (in other words, the misclassification rate). The second is d(E), the average l2 distance from points in p(x) to the nearest error. Surprisingly, or perhaps not so surprisingly since we’ve been studying this for a while, even for very accurate models most data points are very close to an error. That is, even when µ(E) is very small, d(E) is also small. This, they claim, is the problem of adversarial examples.

The dataset is about as simple as it gets: it consists of two concentric spheres in Rn, an inner sphere of radius 1 and an outer sphere of radius R = 1.3 (the choice of R was arbitrary and not deeply explored). Points are labeled with y = 0 for the inner sphere and y = 1 for the outer. This dataset has the attractive properties of being uniformly distributed over the data manifold (not quite true, since the inner sphere has higher probability density) and easy to sample from (sample from an n-dimensional multivariate Gaussian and project to one of the spheres). Additionally the problem can be made more or less difficult by varying n.

Some prior work held the hopeful view that detecting off-manifold input would be sufficient to protect against adversarial examples. Responding to this, the authors wanted to focus entirely on points on the data manifold when finding adversarial examples. Their manifold attack consists of maximizing the loss L subject to the constraint that the perturbed point ˆx should have the same radius as the initial point x. The now-familiar solution to this is projected gradient descent (PGD): take a step in the direction of the gradient, and then project back to the sphere. Note that this doesn’t necessarily find the closest error point, so this approach can only give (an estimate of) an upper bound for d(E).

The authors consider models that minimize sigmoid cross-entropy loss, so, binary classifiers. The first of these is a ReLU network (two 1000-d hidden layers with batch norm) trained with Adam on 50 million randomly sampled datapoints, with n = 500. The model was evaluated on 10 million points on each sphere (so, 2e7 points total) and no errors were found. Nevertheless, a manifold attack found adversarial points on the data manifold!

To see what’s going on, the authors visualize slices of the decision boundary (see above). Random slices all look nearly perfectly circular (left), but choosing an “adversarial direction” — a basis vector in the direction of a point found with the manifold attack — shows a highly distorted shape that extends well past the outer sphere (center diagram). The right diagram slices in two adversarial directions. It is simultaneously the case that the decision boundary is nearly perfectly spherical in nearly all directions and yet has severe distortions.

How frequent or dense are these distortions? As a baseline, to get an idea of how far apart typical points on the sphere are, they sampled 10,000 random points. The closest pair had an l2 distance of 1.25 (I was curious, so I replicated the result with 1000 points — see this gist). The mean distance to the nearest error is only d(E) = 0.18. So the decision boundary is mostly spherical, but practically bristles with these spiky distortions. This only happens in high dimensions — when n < 100 the attack fails.

Next the authors look at a “quadratic network” with a single layer that uses quadratic activations. They chose this unusual network because its decision boundary can be shown to be an ellipsoid centered at the origin (details in Appendix C), which means that it’s possible to know analytically whether any adversarial examples exist. Specifically, there are no errors if and only if the axis coefficients αi are all in the range [1/R^2, 1].

Training on 50 million points resulted in a perfect solution, so they limited the training set to 1 million points. This resulted in no empirical errors on 10 million samples, and an analytically derived error rate estimate of 1e-20. This model had 325 of the 500 αi out of range, yet showed near-perfect accuracy. And of course they were able to find adversarial examples. It is amusing to read the discussion in Appendix E entitled “A Model Can Be Extremely Accurate While Ignoring Most of the Input”. Some highlights: “… if n = 2000 then the model can obtain an estimated accuracy of 1e-14 while ignoring 62% of the input.” And “the size of a network required to achieve very low test error may be significantly smaller than the size of a network required to achieve 0 test error.” This tracks with Madry 2017’s results on robustness and model capacity, though perhaps the specific reasons are different.

This brings us to the main result of the paper, a fundamental bound on d(E) in terms of µ(E), independent of the classifier model. The theorem, proved in Appendix F, says roughly that for a fixed error rate on the sphere dataset the distance to the nearest error is at most O(1/√n) — and this distance goes down (in a way involving the inverse normal cdf function) as the error rate goes up. The intuition is that the most efficient way to distribute the error set E is in a spherical cap — see diagram below — and that the cap has to go nearly all the way to the equator in high dimensions.

This is another one of those counterintuitive results of high-dimensional geometry. Most of the points on the sphere lie close to the equator — in high dimensions very close. Nearly all points in the hemisphere with the spherical cap (in red) are between the equator (dashed line) and the blue line. (In the above, µ(E) = 0.01, so exactly 99% of points lie below the blue line.) Interestingly, all architectures tested were fairly close to the optimal d(E). On the spheres dataset, the only way to improve adversarial robustness is to massively improve classification error.

A question remains: does the theorem above apply to real datasets? The spherical dataset is really quite difficult in one important sense: its PCA components all explain the same amount of variance, whereas e.g. in MNIST a few major components explain most of the variance in the data. It’s possible with MNIST to choose an “error set” E that has a comparatively high d(E), just by choosing a half-plane perpendicular to the top PCA component:

This indicates that there should be room to improve robustness on MNIST, and this aligns with Madry 2017’s strong results. So it seems that the answer is, mostly not — there’s no clear way to apply the theorem, which means that the difficulty of ensuring robustness might lie elsewhere, when it comes to specific models of specific datasets.

So this is an exciting paper. It gives me a stronger intuition around the difficulties of worst-case optimization in high-dimensional space and nicely contextualizes the adversarial examples problem. I’d be very curious to see more work on the applicability of theorems like the one in this paper to real-world datasets.

Madry et al 2017 “Towards deep learning models resistant to adversarial examples” https://arxiv.org/abs/1706.06083

--

--

Mike Plotz Sage

yet another bay area software engineer • learning junkie • searching for the right level of meta • also pie