Adversarial Attacks and Defences for Convolutional Neural Networks

Published in

Onfido Product and Tech

7 min readJan 16, 2018

Recently, it has been shown that excellent results can be achieved in different real-world applications including self driving cars, medical image analysis and human face recognition. These breakthroughs are attributed to advances in Deep Neural Networks (DNN), as well the availability of huge amounts of data and computational power. Characteristic examples of these breakthroughs are self driving cars which are so reliable that they no longer need human drivers inside as a backup; systems that are better than human experts in detecting cancer metastases; and face recognition software that is capable of surpassing human capabilities. But despite these impressive results, the research community has recently shown that DNNs are vulnerable to adversarial attacks.

About Adversarial Attacks

An adversarial attack consists of subtly modifying an original image in such a way that the changes are almost undetectable to the human eye. The modified image is called an adversarial image, and when submitted to a classifier is misclassified, while the original one is correctly classified. The real-life applications of such attacks can be very serious –for instance, one could modify a traffic sign to be misinterpreted by an autonomous vehicle, and cause an accident. Another example is the potential risk of inappropriate or illegal content being modified so that it;s undetectable by the content moderation algorithms used in popular websites or by police web crawlers.

Example attack from Explaining and Harnessing Adversarial Examples.

At Onfido, we are developing state-of-the-art machine learning systems in order to automate a plethora of different problems, including identity verification and fraud detection. For that reason, we’re very interested in understanding these attacks and developing our own defences against them. To this end, three members of our research team recently attended the 2017 Conference on Neural Information Processing Systems (NIPS) in Long Beach, which is considered the most prestigious venue in the field of Machine Learning. This year, the most relevant NIPS event on this topic was the Competition on Adversarial attacks and Defences organized by Google Brain. Here, I’ll summarize some of the most common attacks and defences, as well as the winning methods in the competition.

Some definitions

An adversarial image is an image that has been slightly modified in order to fool the classifier, i.e., in order to be misclassified. The measure of modification is normally the ℓ∞ norm, which measures the maximum absolute change in a single pixel.

In white box attacks the attacker has access to the model’s parameters, while in black box attacks, the attacker has no access to these parameters, i.e., it uses a different model or no model at all to generate adversarial images with the hope that these will transfer to the target model.

The aim of non-targeted attacks is to enforce the model to misclassify the adversarial image, while in the targeted attacks the attacker pretends to get the image classified as a specific target class, which is different from the true class.

Common attacks

Most successful attacks are gradient-based methods. Namely the attackers modify the image in the direction of the gradient of the loss function with respect to the input image. There are two major approaches to perform such attacks: one-shot attacks, in which the attacker takes a single step in the direction of the gradient, and iterative attacks where instead of a single step, several steps are taken. Three of the most common attacks are briefly described next. The first two are examples of one-shot attacks, and the last one is an iterative attack.

Fast gradient sign method (FGSM)

This method computes an adversarial image by adding a pixel-wide perturbation of magnitude in the direction of the gradient. This perturbation is computed with a single step, thus is very efficient in terms of computation time:

Targeted fast gradient sign method (T-FGSM)

Similarly to the FGSM, in this method a gradient step is computed, but in this case in the direction of the negative gradient with respect to the target class:

Iterative fast gradient sign method (I-FGSM)

The iterative methods take T gradient steps of magiture α = ε / T instead of a single step t:

Both one-shot methods (FGSM and T-FGSM) have lower success rates when compared to the iterative methods (I-FGSM) in white box attacks, however when it comes to black box attacks the basic single-shot methods turn out to be more effective. The most likely explanation for this is that the iterative methods tend to overfit to a particular model.

Winning attacks at NIPS 2017 competition

Boosting Adversarial attacks with Momentum (MI-FGSM) was the winning attack in both non-targeted and targeted adversarial attacks competition.

This method makes use of momentum to improve the performance of the iterative gradient methods, as described in the following algorithm.

Basic momentum algorithm from Boosting Adversarial attacks with Momentum. This method uses the gradients of the previous t steps with a decay of µ and the gradient of the step *t+1* in order to update the the adversarial image in the step t+1.

The results show that this method outperforms all other methods in the competition and shows good transferability results, i.e., it performs well in black box attacks as seen in the figure below.

This method uses the gradients of the previous t steps with a decay of µ and the gradient of the step t+1 in order to update the the adversarial image in the step t+1. The results show that this method outperforms all other methods in the competition and shows good transferability results, i.e., it performs well in black box attacks as seen in the figure below.

Success rate vs number of iterations graph from Boosting Adversarial attacks with Momentum. All the attacks shown in this graph were generated using the Inc-v3 model. The target models are the Inc-v3 itself (white-box attack) and four other networks: Inc-v4, IncRes-v2 and Res-152 (black-box attacks). The results show that the momentum based approach achieves the same level of success in white box attacks and the basic iterative method, but consistently outperforms the second method in all black box attacks. It is clear that the performance in black box attacks decreases very rapidly with the number of iterations in the basic iterative method, while it increases in the momentum based method for a longer number of iterations, and even after that, the decrease is very slow.

In order to produce effective attacks against ensemble defence methods, i.e. methods that use a number of different base classification models, a modification to the original algorithm is proposed in which the logits of all the target models are fused before computing the combined cross-entropy loss:

Algorithm to generate an attack using multiple target models from Boosting Adversarial attacks with Momentum. The winning solution for the non-targeted attacks used 8 target models (several inception and reset variants. Including ensembles and adversarial-trained networks). The winning solution for the targeted attack uses two different graphs, one for perturbations smaller than 8 which uses Inc-v3 and IncRes-v2ens, and another that uses 5 networks (several variants of the inception, including ensembles and adversarial trained networks).

Common Defences

The most common defence consists of introducing adversarial images to train a more robust network, which are generated using the target model. It has been shown that this approach has some limitations — in particular, this kind of defence is less effective against black-box attacks than white-box attacks in which the adversarial images are generated using a different model. This is due to gradient masking, i.e., in these kind of defences, a perturbation in the gradients is introduced, making the white box attacks less effective, but the decision boundary remains mostly unchanged after the adversarial training. An alternative approach has been proposed, in which the generation of the adversarial examples is decoupled from the parameters of the model being trained. This is achieved by drawing the adversarial samples from pre-trained models, which are then added to each batch or used to replace part of the non-adversarial images in the batch.

Winning defences at NIPS 2017 competition

The High level representation guided denoiser was the winning submission on the defences track. This solution is built on the observation that despite adversarial perturbations being quite small at the pixel-level, they are amplified throughout the network, producing an adversarial attack. In order to target this challenge, several higher-level denoisers are proposed: a feature guided denoiser (FGD), a logits guided denoiser (LGD) and a class label guided denoiser (CGD). All three methods use a denoising network (DUNET) which is similar to a denoising autoencoder and uses a network structure similar to UNET, i.e., has direct connections between corresponding layers in the encoder and decoder. A fixed pre-trained convolutional neural network (CNN) is also used to guide the training of the denoiser. The FGD uses the responses of last feature layer of the CNN (for the original and denoised images) in order to guide the denoiser. The LGD uses the logits activations of the CNN, and finally the CGD uses the classification output.

Illustration of the three different guided denoisers proposed.

Comparison of the perturbations caused by and adversarial image, an image perturbed with random noise, a pixel guided denoised adversarial image (PGD) and a logits guided denoised image (LGD) across the different layers of the CNN network. This perturbations are defined as *E_l (x_p, x) = |f_l(x_p) − f_l(x)| / |f_l(x)|*, where *E_l* is the perturbation at layer l, the *f_l* is the feature map at layer l and and *x_p* is the perturbed input. The small pixel-level (layer 0) perturbations (*E_l*) using the pixel guided denoiser (PGD) are amplified throughout the network, while the larger perturbations produced by logits guided denoiser (LGD) on the input, are much less amplified, resulting in a much smaller perturbations in the other layers of the network.

The final submission uses an ensemble of four feature guided denoisers (FGD). Each one is trained using adversarial samples from 14 known attacks and one of four pre-trained CNNs (ensV3, ensIncResV2, Resnet152, and RestNet101). The final prediction in done by averaging logits activations for the four networks.