Accurate, reliable and fast robustness evaluation

A novel class of adversarial attacks that can determine the robustness of machine learning models with greater accuracy and reliability.

Published in

Bethgelab

6 min readJan 16, 2020

In this post, I describe a novel class of adversarial attacks (published at NeurIPS 2019 [1]) that is state-of-the-art across four Lp metrics (L0, L1, L2, Linfinity) both in terms of query-efficiency as well as the minimal perturbations being found. The attack is very reliable even in the face of many gradient-masking issues and requires very little hyperparameter tuning. Since the attack works quite differently from other current gradient-based attacks I suggest it to be used alongside any future evaluation of model robustness (implementations are available in Foolbox Native and it can be used in CleverHans 4 (beta) using the simple wrapper CleverFool. A simple usage tutorial is available here).

**Illustration.** A common type of adversarial attacks iteratively synthesizes an image that is as close as possible to a given clean image (e.g. a frog) while changing the decision of the model (e.g. from frog to crocodile). Note that this sequence is only for illustration, in practice the crocodile image on the right would be perceptually very close to the clean frog image. Picture of frog-crocodile hybrid: worth1000.com.

It is well known by now that Deep Learning has a robustness problem. That’s not only true for what we call adversarial perturbations, very small perturbations that derail the predictions of high-performance neural networks. But it’s also true for the process of evaluating the robustness of neural networks itself: usually, model robustness is measured as the mean size of the minimal adversarial perturbations that can be found against a given model. However, we can usually only approximate these minimal adversarial perturbations using optimisation techniques like gradient-decent, leaving us wondering how close we actually are to the “true” minimum. There is a large number of examples in the literature where new defenses have been thought to increase robustness, only to find out later that the evaluation procedure was severely overestimating the robustness of the model [2]. This can happen quite easily: a bit of noise or some saturating elements can subtly alter the backpropagated gradients and prevent gradient-based adversarial attacks from properly finding small adversarial perturbations [2].

To make robustness evaluation more reliable, the community, including myself, has developed guidelines [3] for researchers to detect and prevent problems of gradient masking and other issues that might lead to overestimated model robustness. But aside of that we also need tools that can more reliably find the smallest possible adversarial perturbations. So far existing gradient-based techniques all work in similar ways, starting somewhere close to the unperturbed image (say an image of a frog, see Figure X) and then following directions of strongest descent to find the closest image that is classified as a crocodile by the model (see Figure 1, right side). Since basically all gradient-based boundary attacks rely on this principle they tend to be affected by the same gradient-masking issues. Hence, if one gradient-based attack fails, so do usually all others.

**Figure 1.** Almost all existing gradient-based adversarial attacks start close to the clean image (say again the frog image) and use the gradient to find the closest image for which the decision is changed (say to crocodile). Picture of frog-crocodile hybrid: worth1000.com.

Our new class of attacks [1], which we denote as Brendel & Bethge attacks, works in a substantially different way than current gradient-based techniques, see Figure 2, and is similar in spirit to our purely decision-based boundary attack [4]. In essence, rather than starting close to the clean image we start far away from it. This starting image should be adversarial in the sense that it is classified differently from the clean image (in the untargeted setting) or as part of the target class (in the targeted setting). This starting image could be drawn from the training set (e.g. from the crocodile class), or one uses a lightweight attack like Gaussian noise or DeepFool to find it. We then perform a binary search between the starting and the clean image in order to find the decision boundary of the model (Figure 2, middle panel) that separates adversarial from non-adversarial images. We then follow the boundary towards the clean image until we cannot reduce the distance any further (Figure 2, right side).

**Figure 2.** Our attack works differently in that it starts far away from the clean image (left side), searches a point on the decision boundary of the model via linear search (center) and then follows the decision boundary of the model towards the closest adversarial image using the local gradients. Picture of frog-crocodile hybrid: worth1000.com.

This new attack has the potential to bypass some of the widespread gradient-masking issues. Gradient-masking typically occurs for inputs on which the model is certain, hence in particular around the clean image. Our attack, however, always stays close to the decision-boundary of the model where it is uncertain almost by definition. At the decision-boundary you have the highest chance of retrieving high-quality gradients that allow gradient-based optimisation techniques to follow the boundary towards the minimal adversarial (see Footnote 1 for a short technical description).

The resulting attack surpasses the current state-of-the-art both in terms of query-efficiency as well as perturbation size. We tested the new attack on six different models (five of which are defended) on different datasets (MNIST, CIFAR-10, ImageNet) on four different metrics (L0, L1, L2, Linfinity) in two settings (targeted and untargeted). To be as fair as possible we performed a large-scale hyperparameter search for all attacks on all models in all settings. In almost all cases we found that the new attack surpasses the current state-of-the-art (like PGD, Adam-PGD, Carlini & Wagner, DDN, EAD, SparseFool, etc.) both in terms of the smallest adversarials it can find as well as the well as the number of iterations needed to find them. Below you see the query-distortion curves for the MNIST model of Madry et al. [5], all other plots can be found in the manuscript [1].

**Figure 3.** Query-distortion curves for the adversarially trained model of Madry et al. on four different metrics (L-infinity, L2, L1, L0) compared to the current state-of-the-art attacks. We performed a large-scale hyperparameter search to make the comparison as fair as possible.

As an additional benefit, the attack has only one hyperparameter (the maximum step length or trust region radius), and if one simply sets this parameter to its default value one can already reach very good results.

I believe that this attack should be part of the standard toolbox of anyone trying to evaluate model robustness. It works sufficiently different from other gradient-based attacks to have a chance of escaping some gradient masking issues that affect other techniques. At the same time, be aware that there is no guarantee that this attack works best on your model. Please follow our advice in [3] and try many different attacks (both gradient-based and non-gradient-based techniques) adapted to your specific model and defense technique in order to measure robustness as reliably as possible.

References

[1] W. Brendel, J. Rauber, M. Kümmerer, I. Ustyuzhaninov, M. Bethge. Accurate, reliable and fast robustness evaluation, NeurIPS (2019)

[2] A. Athalye, N. Carlini, D. Wagner. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples, ICML (2018)

[3] N. Carlini, A. Athalye, N. Papernot, W. Brendel, J. Rauber, D. Tsipras, I. Goodfellow, A. Madry, A. Kurakin. On Evaluating Adversarial Robustness

[4] W. Brendel, J. Rauber, M. Bethge. Decision-Based Adversarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models, ICLR (2018)

[5] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, A. Vladu. Towards Deep Learning Models Resistant to Adversarial Attacks, ICLR (2018)

Footnote 1: We follow the boundary by using the backpropagated gradient to estimate the local geometry of the boundary, which we then use to find the optimal step that (a) gets us closer to the clean image, (b) stays within the pixel bounds (e.g. [0, 255]), (c) stays on the decision boundary and (d) isn’t too large. Solving this optimisation problem efficiently in each iteration (using the dual and some tricks from proximal operator theory) was the hardest part of the whole work.

Accurate, reliable and fast robustness evaluation

A novel class of adversarial attacks that can determine the robustness of machine learning models with greater accuracy and reliability.

References

Written by Wieland Brendel