Getting CLEVER(er): Expanding the Scope of a Robustness Metric for Neural Networks

Earlier this year, we introduced a metric called CLEVER (Cross Lipschitz Extreme Value for nEtwork Robustness) that offers an attack-agnostic measure for evaluating the robustness of any trained neural network classifier against adversarial perturbations. CLEVER was presented in our paper at the Sixth International Conference on Learning Representations (ICLR) in Vancouver, Canada. Our research on adversarial attacks is ongoing and we recently enlarged the scope of CLEVER to consider higher-order information of the model and allow robustness evaluation of gradient-obfuscated networks.

Adversarial attack is one way to tamper with neural networks. It involves carefully crafted perturbations called adversarial examples that, when added to natural examples, lead deep neural network models to misbehave. In image classification tasks, adversarial examples can be imperceptible to human eyes while leading to big differences in results between humans and well-trained machine learning models. Outside the digital space, adversarial examples can exist as colorful stickers or 3D printing, giving rise to rapidly increasing concerns on safety-critical and security-critical machine learning tasks.

Figure 1: Measuring robustness of a neural network to adversarial attack.

In Figure 1, the black point in the center represents a natural example (e.g., an image of an ostrich) and the colored curves represent the boundaries for decision-making of a well-trained machine learning model (e.g., a neural network image classifier). Adversarial examples generated by our AAAI 2018 paper (the blue points in Figure 1) are very close to the natural example and are visually similar but will yield different model predictions and will be misclassified by the classifier (e.g., as ‘shoe shop’ and ‘vacuum’).

As we explained in a previous blog post, there was no comprehensive way to measure an AI model’s robustness to adversarial attack prior to CLEVER. CLEVER scores are based on the “robustness lower bound”, or the least amount of perturbation to a natural example required in order to deceive a classifier (the grey region in Figure 1). We provided a theoretical derivation of such robustness lower bound connecting the robustness to the Lipschitz constants of classifier functions. We then proposed an efficient way to evaluate the robustness lower bound by using the Extreme Value Theory, yielding a CLEVER score. CLEVER scores are (1) attack-agnostic, meaning that they estimate well a certified robustness lower bound for existing and unseen Lp norm attacks; and (2) computationally feasible when used on large neural networks, meaning they can be efficiently applied to state-of-the-art ImageNet classifiers.

We recently made two major improvements to CLEVER, extending its previous framework. First, we established a new theoretical certified robustness lower bound by jointly considering the second-order and first-order information of a machine learning model, shedding new light toward providing formal guarantees for neural networks. Second, inspired by the recently proposed backward pass differentiable approximation (BPDA) technique, we extended CLEVER to evaluate the robustness of gradient-obfuscated networks. Gradient obfuscation is a common consequence of many adversarial defense strategies when non-differentiable input transformations are applied to the network to resist adversarial perturbations, such as performing JPEG image compression for every input image.

Our CLEVER score is available as open-source and actively maintained. An implementation of CLEVER can also be found in the open-source Adversarial Robustness Toolbox.

Authored by (L-R) Huan Zhang (IBM Research AI intern, PhD candidate at UC Davis), Pin-Yu Chen (IBM Research Staff Member), Tsui-Wei Weng (PhD candidate at MIT), and Luca Daniel (Professor, MIT)