Adversarial Attacks and Defences for Convolutional Neural Networks
Recently, it has been shown that excellent results can be achieved in different real-world applications including self driving cars, medical image analysis and human face recognition. These breakthroughs are attributed to advances in Deep Neural Networks (DNN), as well the availability of huge amounts of data and computational power. Characteristic examples of these breakthroughs are self driving cars which are so reliable that they no longer need human drivers inside as a backup; systems that are better than human experts in detecting cancer metastases; and face recognition software that is capable of surpassing human capabilities. But despite these impressive results, the research community has recently shown that DNNs are vulnerable to adversarial attacks.
About Adversarial Attacks
An adversarial attack consists of subtly modifying an original image in such a way that the changes are almost undetectable to the human eye. The modified image is called an adversarial image, and when submitted to a classifier is misclassified, while the original one is correctly classified. The real-life applications of such attacks can be very serious –for instance, one could modify a traffic sign to be misinterpreted by an autonomous vehicle, and cause an accident. Another example is the potential risk of inappropriate or illegal content being modified so that it;s undetectable by the content moderation algorithms used in popular websites or by police web crawlers.
At Onfido, we are developing state-of-the-art machine learning systems in order to automate a plethora of different problems, including identity verification and fraud detection. For that reason, we’re very interested in understanding these attacks and developing our own defences against them. To this end, three members of our research team recently attended the 2017 Conference on Neural Information Processing Systems (NIPS) in Long Beach, which is considered the most prestigious venue in the field of Machine Learning. This year, the most relevant NIPS event on this topic was the Competition on Adversarial attacks and Defences organized by Google Brain. Here, I’ll summarize some of the most common attacks and defences, as well as the winning methods in the competition.
Some definitions
An adversarial image is an image that has been slightly modified in order to fool the classifier, i.e., in order to be misclassified. The measure of modification is normally the ℓ∞ norm, which measures the maximum absolute change in a single pixel.
In white box attacks the attacker has access to the model’s parameters, while in black box attacks, the attacker has no access to these parameters, i.e., it uses a different model or no model at all to generate adversarial images with the hope that these will transfer to the target model.
The aim of non-targeted attacks is to enforce the model to misclassify the adversarial image, while in the targeted attacks the attacker pretends to get the image classified as a specific target class, which is different from the true class.
Common attacks
Most successful attacks are gradient-based methods. Namely the attackers modify the image in the direction of the gradient of the loss function with respect to the input image. There are two major approaches to perform such attacks: one-shot attacks, in which the attacker takes a single step in the direction of the gradient, and iterative attacks where instead of a single step, several steps are taken. Three of the most common attacks are briefly described next. The first two are examples of one-shot attacks, and the last one is an iterative attack.
Fast gradient sign method (FGSM)
This method computes an adversarial image by adding a pixel-wide perturbation of magnitude in the direction of the gradient. This perturbation is computed with a single step, thus is very efficient in terms of computation time:
Targeted fast gradient sign method (T-FGSM)
Similarly to the FGSM, in this method a gradient step is computed, but in this case in the direction of the negative gradient with respect to the target class:
Iterative fast gradient sign method (I-FGSM)
The iterative methods take T gradient steps of magiture α = ε / T instead of a single step t:
Both one-shot methods (FGSM and T-FGSM) have lower success rates when compared to the iterative methods (I-FGSM) in white box attacks, however when it comes to black box attacks the basic single-shot methods turn out to be more effective. The most likely explanation for this is that the iterative methods tend to overfit to a particular model.
Winning attacks at NIPS 2017 competition
Boosting Adversarial attacks with Momentum (MI-FGSM) was the winning attack in both non-targeted and targeted adversarial attacks competition.
This method makes use of momentum to improve the performance of the iterative gradient methods, as described in the following algorithm.
The results show that this method outperforms all other methods in the competition and shows good transferability results, i.e., it performs well in black box attacks as seen in the figure below.
This method uses the gradients of the previous t steps with a decay of µ and the gradient of the step t+1 in order to update the the adversarial image in the step t+1. The results show that this method outperforms all other methods in the competition and shows good transferability results, i.e., it performs well in black box attacks as seen in the figure below.
In order to produce effective attacks against ensemble defence methods, i.e. methods that use a number of different base classification models, a modification to the original algorithm is proposed in which the logits of all the target models are fused before computing the combined cross-entropy loss:
Common Defences
The most common defence consists of introducing adversarial images to train a more robust network, which are generated using the target model. It has been shown that this approach has some limitations — in particular, this kind of defence is less effective against black-box attacks than white-box attacks in which the adversarial images are generated using a different model. This is due to gradient masking, i.e., in these kind of defences, a perturbation in the gradients is introduced, making the white box attacks less effective, but the decision boundary remains mostly unchanged after the adversarial training. An alternative approach has been proposed, in which the generation of the adversarial examples is decoupled from the parameters of the model being trained. This is achieved by drawing the adversarial samples from pre-trained models, which are then added to each batch or used to replace part of the non-adversarial images in the batch.
Winning defences at NIPS 2017 competition
The High level representation guided denoiser was the winning submission on the defences track. This solution is built on the observation that despite adversarial perturbations being quite small at the pixel-level, they are amplified throughout the network, producing an adversarial attack. In order to target this challenge, several higher-level denoisers are proposed: a feature guided denoiser (FGD), a logits guided denoiser (LGD) and a class label guided denoiser (CGD). All three methods use a denoising network (DUNET) which is similar to a denoising autoencoder and uses a network structure similar to UNET, i.e., has direct connections between corresponding layers in the encoder and decoder. A fixed pre-trained convolutional neural network (CNN) is also used to guide the training of the denoiser. The FGD uses the responses of last feature layer of the CNN (for the original and denoised images) in order to guide the denoiser. The LGD uses the logits activations of the CNN, and finally the CGD uses the classification output.
The final submission uses an ensemble of four feature guided denoisers (FGD). Each one is trained using adversarial samples from 14 known attacks and one of four pre-trained CNNs (ensV3, ensIncResV2, Resnet152, and RestNet101). The final prediction in done by averaging logits activations for the four networks.
Final results from the NIPS competition
Defenses
1st place: 95.3% | Defense against Adversarial Attacks Using High-Level Representation Guided Denoiser
2nd place: 92.4% |Mitigating Adversarial Effects Through Randomization
3rd place: 91.5% | MMD
Baseline: 77.3%
Non-targeted attacks
1st place: 78.2% | Boosting Adversarial Attacks with Momentum
2nd place: 77.7% | Ensemble Adversarial Training: Attacks and Defenses
3rd place: 77.4%
Baseline: 34.6%
Targeted attacks
1st place: 40.2% | Boosting Adversarial Attacks with Momentum
2nd place: 36.9 % | Ensemble Adversarial Training: Attacks and Defenses
3rd place: 36.8%
Baseline: 20%
The official results can be found here.