Why Adversarial Attacks are important in the path to human-like-AI

Niranjan Rajesh
Bits and Neurons
Published in
7 min readJul 8, 2024

Deep learning based algorithms have been at the forefront of vision tasks for over a decade now. These large artificial neural networks (CNNs and Vision Transformers) are able to achieve human-levels of performance in visual tasks like image classification and object detection. Despite their apparent successes in the domain, evidence of their shortcomings are not lacking. One such weakness that has garnered significant attention in the deep learning community is the brittleness of these networks. Tiny perturbations, or changes to the pixel values, of an input image cause unpredictable and undesired behaviour in these networks. In this article, I will outline this problem of adversarial attacks, why they arise and why they are relevant in furthering the state of Artificial Visual Intelligence.

An Introduction to Adversarial Attacks

That’s right, tiny changes to an input image’s pixel values can cause a trained and competent Convolutional Neural Network (CNN) to completely misclassify the image. How tiny you might as? As tiny as one pixel!

The phenomenon of adversarial attacks was popularised by the aptly-named 2014 paper Intriguing Properties of Neural Networks by Ian Goodfellow and team . These attacks involve the addition of non-random noise to an image such that a CNN that could classify the original image successfully before, is now fooled. A famous example that credits Deep Learning with ‘getting pigs to fly’ is displayed below.

An input image of a pig being adversarial perturbed such that a CNN is fooled to predict it as an airliner

To the human eye, the image on the right is still, clearly, of a pig. However, a state-of-the-art CNN sees an airplane. What is going on here?

Some insights on this phenomenon can be gained by going over how such adversarial examples are constructed. It is important to note that these examples are, in fact, constructed and not random or naturally occurring. Nevertheless, they show a glaring weakness in machine vision today.

How do you make an adversarial attack?

Before we dive into making adversarial attacks, how do you teach a CNN to classify images? The teaching is, in its basic form, an optimisation problem. Given many [image, label] data points, we want our CNN to get as many predicted labels of images right as it can. This is where a loss function comes in. This special function takes the CNN’s predicted label and the true label from our dataset and tells us ‘how wrong’ the prediction is. We, of course, want to keep this loss value as low as possible. So, the optimisation problem we have at hand can be summarised in the following equation where we find the CNN parameters (θ) that minimise our loss function for all N images and labels in our dataset. Yes, ML is that simple!

A simplified formulation for training a CNN by minimising loss

Now, we can think of crafting an adversarial attack in a similar way. Now that our intentions have ‘reversed’, we just need to go in the opposite direction to fool the model: increase the loss! Given an input image, we just need to add a perturbation (δ) to it such that we maximise the loss function. The resulting δ needs to be bounded such that the perturbed image still looks the same to the human eye.

A simplified formulation for generating an adversarial example by maximising loss

This is the foundational idea behind constructing an adversarial attack when we are given an input image and the CNN’s parameters. More detailed information with how exactly these perturbations are constructed from the parameters can be found in this paper by a few of the same authors — Explaining and Harnessing Adversarial Examples.

Why are networks prone to Adversarial Attacks?

Now that we are more comfortable with these attacks, we can start to think about why they work. How is it that a state-of-the-art CNN with superhuman classification accuracy on natural images is easily fooled by such a minute change in the image? Moreover, when the change itself is imperceptible to humans?

Researchers at the Madry Lab propose that adversarial examples, like the one above, are an unavoidable consequence of the way we train CNNs. In their paper, they claim that images consist of visual features that can be separated into two kinds — robust and non-robust. Robust features can be though of as human-centric features i.e. components of an image that humans use to classify an image (e.g. the snout and ears of a dog or the whiskers and tail of a cat). On the other hand, non-robust features, or model-centric features are imperceptible to humans but perceptible and useful for CNNs to classify. These features are typically statistical correlations between pixels and the label of an image that are meaningless to humans.

Since a CNN is trained over numerous iterations with the sole objective of minimising the loss function, it learns all features it can from each image. CNNs, thus, pick up robust and non-robust features as both sets of features are correlated to, and have predictive power over the label for each image.

An illustrated distinction of robust and non-robust features in an image

The paper claims that these non-robust features are ‘flipped’ (within a constraint like an l2 ball to ensure imperceptibility) during an adversarial attack. In other words, an adversarial perturbation changes the non-robust features in a dog image from that of a dog to a cat while not touching the robust features. This causes the image to still look like a dog to us but since a CNN sees both sets of features, it is confused and often results in misclassification.

The existence and predictive power of non-robust features were verified by an experiment from the same paper. A new dataset was constructed with only adversarial example under an important constraint — all images of classA was perturbed to classB, classB to classC and so on. This meant that an image of classA now had the robust features of classA and non-robust features of classB. Now, all the labels of the dataset were changed such that it matched the non-robust features rather than the robust (so that it looks mislabelled to us). An illustration of this new dataset is pictured below.

When a new CNN was only trained on this constructed dataset (where only the non-robust features are relevant to the label), it still managed to get good accuracy on the original dataset! This implied that when a model is forced to learn only the non-robust features (as the robust-features did not correctly correlate to the label), it still gained predictive power over the original data. The new CNN was classifying only based on the non-robust features!

A visualisation of the non-robust feature disentanglement experiment from the Madry paper

Beyond the Madry Lab’s distinction between robust and non-robust features, there are a few other (less convincing) explanations for the existence of adversarial attacks. One such explanation is from Adi Shamir (yes, the Shamir of RSA) and co — the dimpled manifold model. This work and other hypotheses credit the high dimensional spaces/manifolds spanned by the activations (outputs) of each neuron in the CNN and their properties with the occurrence of adversarial attacks.

The significance of Adversarial Attacks

Now, what is the significance of all this? Why bother investigating an apparent ‘feature’ of deep neural networks?

Naively, the very fact that these networks are misclassifying images that look like pigs (to us) as airplanes suggests that there is a significant gap in the way humans and machines process visual information.

From Madry’s work, we know that machines seem to prioritise the learning of non-robust features alongside visual features that we consider important. To these networks, these non-robust features help them solve the optimisation problem they were give — they help in minimising the loss function. However, if we want CNNs to behave predictably and reliably, we need to align its classifying behaviour with humans. Unreliable CNNs cannot be integrated into systems where security is paramount like self-driving cars, defense or health.

In my opinion, the problem of Adversarial Attacks feels more like a diagnostic tool than a traditional obstacle on the way to artificial general intelligence. To bridge the gap between human and artificial visual intelligence, the internal mechanisms of both types of networks need to be studied to understand why the latter is vulnerable to such trivial perturbations whereas the former isn’t. New networks need to be designed with biologically-inspired priors so that the non-robust features are not favoured during the training process. This might be a path towards the next generation of Artificial Visual Intelligence.

By investigating the source of adversarial vulnerability in CNNs and effective solutions to it, we may just find ourselves one step closer to human-like AI.

--

--

Niranjan Rajesh
Bits and Neurons

Hey! I am a student at Ashoka interested in the intersection of computation and cognition. I write my thoughts on cool concepts and papers from this field.