The unusual effectiveness of adversarial attacks

6 min readJul 31, 2019

There has been a growing security concern in the machine learning community with the advent of deep neural networks. People have questioned the interpretability of neural networks and naturally, questions have been raised regarding the security consequences of a deep learning model. Adversarial attacks are a used to find examples of images/data on which the machine learning model behaves completely randomly. Moreover, the output of the network on these adversarial examples can be crafted to be any desired output class. This later outcome is particularly disturbing.

This post talks about the most basic form of adversarial attacks and an intuition as to why they are so effective, and more importantly, why they are so hard to defend against.

Advancements in this field have yielded the development of mathematically crafted attacks on these deep neural networks that result in mis-classification of images. These are called adversarial examples. A very famous example can be seen on the top on the post where the classifier classifies the stop sign as a clock with an extremely high confidence (99%). This can clearly be a huge threat to autonomous vehicles. Other fields are also not robust to these attacks.

How is it done?

The discovery of constructing adversarial examples was particularly interesting. Researchers at google were using the CIFAR-10 dataset for certain image classification purpose. They tried to transform an image of truck class into the image of an airplane class. They were doing so by iteratively changing the pixel values of the truck image to resemble an airplane. They tweaked the pixel values of the input image (truck) using back-propagation by using a pretrained image classifier to incorrectly classify the input image as an airplane. By the end of this exercise, they noticed that the classifier was classifying this image as airplane with a pretty high confidence.

The had assumed that the network must have transformed the input image to resemble an airplane for the classifier to label it as one. Seems fairly straightforward, no?

However, this was not the case. The input image still very much looked like a truck. This little lucky experiment gave birth to the idea of adversarial examples and adversarial attacks.

What makes these examples adversarial?

The adversarial part about these types of attacks is that the predicted class looks nothing like the actual class. These tiny changes made to the input network are imperceptible to humans, making them adversarial in nature.

The image above shows an example of an adversarial attack, wherein adding a little bit of noise completely changes the class of the image. A pig, that was correctly classified as a pig by the classifier, is now classified as an airliner after the attack. The noise added looks random to you and I, but it is infact carefully constructed to cause the network to classify the image as an ‘airliner’.

This type of attack, wherein the attacker can optimize a single image directly by using the weights and gradients of the neural network are called white box attacks. White box here signifies the attacker’s complete and open access of the neural network. These attacks are extremely hard to prevent and drive the accuracy of state of the art image classifiers down to absolute 0% on adversarial examples. More shocking however, is what follows:

These attacks have shown to be extremely effective even under not so white box (open access) settings. Black box adversarial attacks encompass a set of methods wherein the attacker does not have access to the network parameters. Under such cases, attackers train their own image classification (or any machine learning model) network and construct adversarial examples on their own network. These adversarial examples are transferred to unknown networks with a very high attacking accuracy. Methods of doing so are fairly advanced but extremely interesting. I wouldn’t go into too much detail in this post but I encourage you to read more about them here. What I want to discuss here is why these attacks are so effectively and what makes them so hard to defend against.

The intuition

There have been a relatively large number of defenses for tackling the problem of adversarial attacks. However, it seems that any new defense mechanism that is proposed seems to be broken by a more advanced form of attack the following year. A good summary of CVPR papers from over the years relating to adversarial attacks and defenses has been provided in this repository. See for yourself!

There are a few reasons why this might happen:

Attacker always has the edge. In this game of attacking and defending a machine learning model, it’s the defense that makes the first move. People come up with new adversarial defenses and equip their networks using this defense. But the black box attacker doesn’t care! For the attacker, the defense is just another part of the black box that can be broken by a smarter optimization technique or more compute. Moreover, there can be attacks created specifically to break certain defense mechanisms, but the defense has to be strong enough to fend all types of attacks (there are so many) to be truly robust.
Attacking is much easier: Optimizing an input image is much easier than training a neural network. It is natural to assume that finding such a specific adversarial example in such a high dimensional space (number of pixels in an image) would be hard. But using back-propagation, this task is fairly trivial. This is because the output of a network with respect to it’s input is linear when using the ReLU activation function. “But…but…neural networks are highly nonlinear, that’s what makes them so effective…blah blah.” Yes, but the non-linearity of neural networks is with respect to its parameters, not the inputs (again, for ReLU). For the inputs, the neural network is just a piece-wise linear function. Each pixel gets multiplied by a number (weight), a number is added to it (bias) and the max of pixel value and 0 is retained. This happens over and over again. This in fact, is perfectly piece-wise linear. This linearity makes it easy to optimize the linear input space (image) to generate adversarial examples. Moreover, small changes caused in each pixel does not make a visual difference (thus, adversarial), but causes the L-2 norm of the image to change drastically for the network to misclassify it.
The third, and one of the main reasons for this is that it is extremely hard to construct a theoretical model of the process of attack on a neural network. To solve any problem, it is very important to have a good problem statement, and a theoretical model of what causes the problem. A problem statement such as ‘making neural networks robust to all adversarial attacks’ does sound appealing, but is extremely vague. As of now, we do not have a set model for attacks. They come in all sorts and forms; Randomly formulating defensive strategies against a few of these attacks is not a good approach. Current defenses propose a method to shut down one particular attack, but in the same process, leave the model vulnerable to 10 more. It is very important to first construct a theoretical model that encompasses all adversarial attacks, i.e. using which all attacks can be explained. This model has to be concentrated around the attacking procedure in general, and not specific forms of attack, so that it is also immune to new attacks that are developed in the future.

Final remarks

Overall, as of today, defending adversarial attacks on neural networks is a hard problem. The linearity between the inputs and outputs and the mostly non-interpretable parameters, coupled with a vast pool of adversarial attacks are a major reason behind this.

Currently, the attacking community holds a significant advantage over the defending community in the field of adversarial examples. When we are able to find the unified logic behind all adversarial attacks, the one thing that makes them all work so effectively, we can think about creating a truly robust adversarial defense. This might take some time, but it can be done. Inspiration can be taken from fields such as differential privacy and cryptography, where the defender holds upper ground.

Thank you!