Paper Summary: Practical Black-Box Attacks against Machine Learning

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/24.

Practical Black-Box Attacks against Machine Learning (2016) Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, Ananthram Swami

There’s something appealing about seeing someone tear down their own work. Or at least publicly say oops. Or publicly imply oops. This is one such example, in which Papernot et al construct an attack that obviates a previous attempted defense against adversarial examples. It’s been observed that adversarial examples transfer well between neural classifiers that have been trained on similar data, but until now such attacks have been limited to either white-box attacks (where attackers have access to or intimate knowledge of the target system) or at the very least to large amounts of sufficiently similar training data. This paper breaks through that limitation with a new querying heuristic that efficiently extracts information about a classifier’s decision boundaries just by observing its label outputs.

The threat model the authors are considering is a black-box attack against a hosted classifier service. No knowledge about the architecture is assumed, only some idea of the domain of interest (e.g. recognizing handwritten digits). Additionally the number of queries to the target system should be limited, and this means the attacker needs to be strategic about choosing queries. The attack strategy is to train a substitute network on a small number of initial queries, then iteratively perturb inputs based on the substitute network’s gradient information (details below) to augment the training set.

This paper’s attack is based on a heuristic the authors call Jacobian-based Dataset Augmentation. The intuition is that we should be able to make educated guesses about which points to query based on current knowledge of the substitute network. A proxy for this is moving roughly in the direction of the gradient. The procedure goes like this: given a point x, get the oracle (target) label O(x). Then evaluate the Jacobian J of our substitute model and use the column of the Jacobian corresponding to O(x). Symbolically, the perturbed point is x + λ sign(J[O(x)]).

Which λ though? This had to be tuned (see §6.1 “Setting the step size”) to balance a training speed vs stability trade-off, ending on λ = 0.1. I also wondered whether the sign of the update was sensible; it turns out the authors thought of this too and they ended up periodically alternating between positive and negative values of λ. Incidentally, we don’t actually care about how accurate our substitute model is — in fact the authors found that model accuracies are uncorrelated with how well adversarial samples transferred — we only care about how well we approximate the target model’s decision boundaries.

To sum up, the Substitute DNN training algorithm is (1) acquire a small initial dataset, (2) select an architecture for the substitute network that’s sensible for the domain, and iterate the following: (3) query the oracle (target) network for labels to any unlabeled datapoints, (4) train the substitute network on current datapoints, and (5) augment the dataset with JbDA perturbations. To limit query count, new samples are drawn using reservoir sampling. Once the substitute network is trained, it can be used to craft adversarial examples with the standard approaches we’ve seen before — they used Goodfellow 2015.

The authors tried their approach on models hosted on MetaMind, Google, and Amazon and successfully generated adversarial examples against these models. To validate the feasibility of the attack in cases where the attacker doesn’t have access to the training distribution, the authors also tried a handcrafted set of MNIST-like digits. This still worked, but not as well as starting with MNIST digits. The fact that access to the original training data is valuable seems important for understanding the phenomenon as a whole.

The Substitute DNN attack turns out to work against logistic regression models, SVMs, decision trees, kNN, and distilled networks, which can all be thought of as “gradient masked” and therefore not amenable to a standard white-box attack. The authors tested the attack against the defensive distillation from Papernot 2015, which was successful. And which, as I noted in the intro paragraph, was oddly satisfying after reading the distillation paper.

Pretty cool that this is possible. Still, the paper doesn’t quite get at the real concern, which has to be in relation to a model of harm caused by adversarial attacks. Intuitively “adversarial” inputs should only matter to the extent that they “actually” represent a different class from the target model’s label. One such model of harm is that differences between what humans see and what algorithms see can be exploited, since e.g. humans will replace obviously wrong street signs, but won’t replace signs that have been adversarially altered until damage has been done. Other harm models are possible, and we’d be better served if researchers were explicit about their assumptions.

Goodfellow et al 2015 “Explaining and Harnessing Adversarial Examples”

Papernot et al 2015 “Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks”