Adversarial learning benchmark: evaluating attack and defence strategies for computer vision

Cédric Goubard

Follow

Published in

Meetech - We Love Tech

7 min readMar 5, 2021

--

This study was conducted by Wavestone’s Machine Learning & Data Lab — special thanks to Timothee Rio for all of his hard work during his internship among us. After a series of articles (in French; part 1, part 2, part3) presenting the basic notions behind deep learning and adversarial attacks, we focus here on the more technical aspects of attack and defence strategies for computer vision models.

We assume that the reader is already familiar with the characteristics of adversarial attacks (white/black-box, (un)targeted, single/multiple steps…). Feel free to take a look here (French) or here (English) for a quick reminder.

A final touch before we start: we open-sourced all the code required to reproduce our benchmark on Github; feel free to open a PR or an issue if you find a way to improve our work!

Methodology

We used TensorFlow to train deep learning models on several datasets, before running 3 white-box and 2 black-box attacks, which we tried to mitigate based on 3 defence strategies.

Datasets:

MNIST — a simple database of 70 000 labelled handwritten digits that has a rather low internal variance
CIFAR-10 — made of 60 000 labelled images that belong to 10 classes (aeroplane, automobile, bird etc.). This data set has a higher internal variance than MNIST and can be considered much more complex

Model architectures:

EfficientNet — a state-of-the-art computer vision model, ranked first at the beginning of 2020 on the ILSVRC. Several model sizes are available, the larger ones providing better performances; we used EfficientNet-B7 (the largest), which achieves state-of-the-art 84.3% accuracy on ImageNet.
A small homemade CNN, made of 2 convolutional layers and 1 dense layer. This network has far fewer parameters to optimize than EfficientNet.

Metrics:

Success rate (SR): the ratio of the number of adversarial examples that have fooled the model on the total number of adversarial examples

The mathematical formula defining the success rate: number of successful adversarial example divided by total number of adversarial examples — Definition of the success rate (SR)

Degree of change (DOC): the average mathematical distance between adversarial examples and benign input. In other words, it quantifies how much an attack has to modify the inputs to fool a model. Higher values mean higher noticeability by human eyes.

White-box attacks, i.e. attacks requiring access to the model itself:

Fast Gradient Sign Method (FGSM): white box single-step attack, which adds to the benign image a perturbation that will cause the CNN to misclassify it. This attack can be both targeted or untargeted; in this article, we only considered the untargeted version. For an input image, the method uses the gradients of the loss with respect to the input image to create a new image that maximises the loss. This can be summarised using the following expression:

Fast Gradient Sign Method (FGSM) to generate adversarial examples, from the original paper

Deepfool: white box untargeted multiple steps attack. Intuitively, this attack aims at finding the closest decision boundary to the input, and then iteratively push the input to the other side of that boundary. In this benchmark, we compare two versions of this attack, by letting the attack run only 1 step, or multiple.

Deepfool’s approach, from the original paper

Black-box attacks, i.e. attacks only requiring to observe the inputs and outputs of the model.

Jacobian-based Saliency Map attack (JBSMA): black-box, multiple steps attack that can be either targeted or untargeted. It relies on a “saliency map” that rates each pixel of the input image according to its influence on the final prediction of the CNN. With this map, the algorithm will then change the value of some pixels in order to have the greatest possible impact on the prediction of the model. This attack is not present in the final results, as it took too much time to run (2 minutes for 1 image on the small CNN, over 15 minutes on EfficientNet).

JBSMA’s approach, from the original paper

Boundary attack: the main idea is to initialize the attack image with an image that is already adversarial (for instance random noise that the CNN misclassifies) and to perform a random walk so that the image remains misclassified, but gets closer to the target image (benign input).

Boundary attack to generate an adversarial example looking like a 7, but classified as anything else (here a 3)

Results #1 — Attacks performance

Results of our attack benchmark, showing each attack’s success rate for different degrees of change, models and datasets. 100 images were used from an isolated test set to compute each SR measure for a given DOC.

There is a lot of information in this chart, so let us highlight the key outputs:

Attacks are more effective and less visible on CIFAR (complex dataset); indeed, since curves in the CIFAR row are steeper, it means that all attacks achieve a higher success rate while keeping a low degree of change. This might come from the higher complexity of the CIFAR data set; since the images are harder to classify, the CNNs are less confident in their prediction and thus a slighter amount of noise can fool them. We can expect this phenomenon to be even more significant in real-world images.
Among white-box attacks, the most complex (Deepfool multiple steps) is the most effective, followed by Deepfool with 1 step, followed by FGSM. However, complex attacks are also time-consuming, as illustrated in the table below

Running time for 100 attacks on EfficientNet

The boundary attack provides the best results unless you require a particularly small degree of change. Its success rate will by definition always be 1, but it will always incur some change to the image. The degrees of change are lower on CIFAR than on MNIST, because of MNIST’s high number of black pixels.

Results #2 — Focus on black-box attacks

As mentioned, the saliency map attack was not included in the benchmark, as it took too much time to run, especially on EfficientNet.

Despite being longer to run, the JSMA reaches high SR with quite a small DOC (4–5% on the 10 images we generated, which always fooled the model). It focuses its strength on a few well-chosen pixels, whereas FGSM or Deepfool apply a more global change.

Results of a Jacobian Saliency Map Attack

On the other hand, the boundary attack is by design more general; it will converge more quickly but will be unable to reach the same low DOC.

About 50 iterations of the boundary attack are required to reach acceptable results

Results #3 —Defence against white-box attacks

Given the results of the attacks benchmark, we chose to focus on FGSM, as it provides similar results as the other white-box attacks while being the fastest (even faster than black-box attacks).

Defence strategies:

Adversarial training: this strategy consists of training the CNN with a mix of benign inputs and adversarial images. This training should make the network less sensitive to the adversarial images’ noise, and thus act as a regularization method. We generate a new set of 100 adversarial samples for each model. The custom objective function is described below, c being the emphasis parameter on adversarial samples:

Random input transformation: another possible strategy consists of utilizing two random transformations, random resizing and random padding, to mitigate the adversarial effects.

Random padding approach (image from ScienceDirect)

Denoising: the idea is to remove the noise added by the attack before giving an image to the CNN, by training another network (e.g. an autoencoder) that returns an image that could be the original one when it is given an adversarial image.

Effect of the denoiser on an adversarial input image

After implementing these strategies, here are our results:

Results of our defence benchmark on white-box attacks, showing each defence strategy’s impact on the attack success rate for different degrees of change and datasets.

As before, let us highlight the key aspects shown in these charts:

All defence strategies are effective against white-box attacks, but the most promising one seems to be the denoiser. Defences tend to be more effective on MNIST than on CIFAR, probably because the denoiser will tend to learn that MNIST border pixels should always be black. Conversely (but for the same reason), random padding on MNIST will have little effect, which explains why this strategy works best on CIFAR.

Results #4 — Defence against black-box attacks

We applied the 3 same strategies to the boundary attack; here are the results.

Since black-box attacks always succeed given enough degree of change, we only plot the DOC at successive iterations (and not the success rate).

All strategies are effective against black-box attacks, with no clear winner, but the denoiser does not particularly shine on any dataset.
Random padding’s performance will depend on each specific image. This particular defence strategy adds some randomness to the CNN, which makes the attack effectiveness more random because the decision boundaries also evolve over time. If during one step a modified pixel had no importance on the decision of the CNN, it might become more important during the next step because of the new padding of the image.

Final words

We hope that our benchmark helped you better understand adversarial learning!

If you are interested in working on similar topics and see how we use these concepts for our clients’ real-life production models, contact me!