Universal Adversarial Training — Paper Summary

Published in

ML Summaries

5 min readMar 7, 2022

Paper: Universal Adversarial Training
Link: https://ojs.aaai.org/index.php/AAAI/article/view/6017/5873
Authors: Ali Shafahi, Mahyar Najibi, Zheng Xu, John Dickerson, Larry S. Davis, Tom Goldstein
Tags: Adversarial attack, Universal attack, White-box attack
Code: -
Misc. info: Accepted to AAAI’20

What?

In this paper, the authors propose an optimized way to find the universal adversarial examples (first introduced in Moosavi-Dezfooli et al.[1]) for a given model. The authors also propose a low-cost algorithm to robustify the model against such perturbations.

Why?

Universal Adversarial Perturbations (UAP) are cheap — a single noise can be used to cause the model to mislabel lots of images. (Unlike the usual attacks where perturbations are generated per-image basis. These are more effective though). UAPs are also found to transfer across different models, hence they can be used in black-box attack setting too. Hence it is important to study them.

Pre-requisites:

UAP vs Adversarial Perturbation: To attack the given model, in a common adversarial attack case, we find a unique delta for each image so that the model misclassifies it. In the UAP case, we find a delta and use it for all the images.

UAP computation in [1]: UAP is first introduced in [1]. It is a simple technique, however, there are no convergence guarantees. The authors iterate across images and keep updating delta till ξ percent of images are misclassified. And the update in each iteration is computed with DeepFool [2]. The attack formulation and algorithm are shown below.

Adversarial training: To make the models robust against adversarial attacks, Madry et al proposed adversarial training, wherein every iteration, we generate adversarial examples and then calculate the loss on them and then update the weights on that loss. The formulation is as follows. (Z is the perturbed image)

Some examples of images with UAP added in them and the predicted class. Refer to Figure 3 in [1].

We are all done with the pre-requisites. Beyond this, I will discuss the paper’s contributions.

How?

Improved UAP computation: In this paper, the authors simplified the formulation to find the delta that maximizes the loss. This way we can update the δ with an optimizer. The loss formulation is as follows. Since the loss is unbounded above, the authors propose a clipped version of this loss. This formulation searches for a universal perturbation that maximizes the training loss and thus forcing images into the wrong class.

(left) The formulation proposed in this paper to find UAPs (right) Clipping the loss. Beta is hyperparameter.

The above optimization problem can be solved by a stochastic gradient method, get loss gradient wrt δ, say g, update the δ to δ +l.r.*g, then project back δ to ε l_p ball.

Improved UAP adversarial training: Similarly the authors propose to find a UAP for a given batch and train the model with perturbed inputs (x_i + δ)

The authors also explore the fast adversarial training case where the weights and δ are backproped simultaneously. It gives decent enough performance. We can see the algorithms for both styles of UAP adversarial training in the below figures.

(left) Algorithm for above adversarial training formulation, (right) Fast adversarial training formulation

Results:

On CIFAR-10 dataset, with WideResnet-32 architecture, for ε = 8, 42.56% for the SGD perturbation, 13.30% for ADAM, and 13.79% for PGD. The clean test accuracy of the WRN is 95.2%.
In the below set of figures, we can see what the UAP looks like for CIFAR-10, from normally trained and robustly trained models. The robust models seem to have low-frequency UAPs than clean models. Also, the UAP can vary a lot with optimizer.

(Left) WRN 32–10, UAP after 160 iterations across different optimizers. (Right) with 400 iterations) UAPs on CIFAR-10 robust models: adversarially trained with FGSM or PGD, and universally adversarially trained with FGSM (uFGSM) or SGD (uSGD).

The attack formulation is more successful than previous methods, as we can see from below left table.
The right table below shows that the adversarial training proposed in the paper is more robust to UAP perturbations and also has better generalization behavior. Interesting to note that the most ubiquitous adversarial training method (denoted by PGD) is not quite robust to UAPs.

(Left) Test accuracy of clean trained models on UAP attacked test images of ImageNet (Right) The test accuracy of adversarially trained models on CIFAR-10 UAP attacked test images.

3.9%, 42%, 28.3% are the accuracies on UAP attacked test images of ImageNet.

The UAPs generated from the UAP-trained model vary significantly compared to the low-cost UAP-trained model. It somehow feels like the natural model’s (not adversarially trained) UAP looks high-frequency while the UAP-trained model has the lowest frequency, with the low-cost model’s UAP looking somewhere in between.

Comments:

Overall it’s an interesting paper. The formulation for UAP computation introduced by this paper is computationally efficient and the perturbations are more effective. I wonder how the norm of the UAPs generated through this method compare to the UAPs computed using [1].

This training method seems to have better clean test performance than the usual adversarial training, one thing I wish they addressed is, how effective is this UAP adversarial training against the usual evasion attacks.

Bibliography:
[1] — Moosavi-Dezfooli, Seyed-Mohsen, et al. “Universal adversarial perturbations.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[2] — Moosavi-Dezfooli, Seyed-Mohsen, Alhussein Fawzi, and Pascal Frossard. “Deepfool: a simple and accurate method to fool deep neural networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.