What are Adversarial Examples?

Karan Kashyap
Analytics Vidhya
Published in
5 min readAug 8, 2020
Source: UCDavis Department of Statistics

In recent times, Machine Learning (a subset of Artificial Intelligence) has been at the forefront of technological advancement. It appears as though it is a strong contender for being the tool that could catapult human abilities and efficiency to the next level.

While Machine Learning is the term that is commonly used, it is a rather large subset within the realm of AI. Most of the best machine learning based systems in use today actually belong to a subset of Machine Learning known as Deep Learning. The term Deep Learning is used to refer to a Machine learning approach that aims to mimic the functioning of the human brain to some extent. This helps bestow upon machines the power to perform certain tasks that humans can, such as object detection, object classification and much more. The Deep Learning models that are used to achieve this are often known as Neural Networks (since they try to replicate the functioning of the neural connections in the brain).

Just like any other software, however, Neural Networks also come with their own set of vulnerabilities, and it is important for us to acknowledge these vulnerabilities so that the ethical considerations of the same can be kept in mind when further work in the field is carried out. In recent times, the vulnerability that has gained the most prominence are known as Adversarial Examples. This article aims to shed some light on the nature of Adversarial Examples and some of the ethical concerns that arise with the development of deep learning products as a result of these vulnerabilities.

What are Adversarial Examples?

The “regular” computer systems that most of us are familiar with can be attacked by hackers, and in the same way, Adversarial Examples can be thought of as a way of “attacking” a deep learning model. The concept of Adversarial Examples are best explained by taking the example of an Image Classification Neural Network. Image classification networks identify features of images through the training dataset and are later able to identify what is present in a new image that they have not seen in the past. Researchers have identified that it is possible to apply a “perturbation” to the image in such a way that the change is so small that it cannot be noticed by the human eye, but it completely changes the prediction made by the Machine Learning model.

The most famous example is of an Adversarial example generated for the GoogLeNet model (Szegedy et al., 2014) that was trained on the ImageNet dataset.

Source: Explaining and Harnessing Adversarial Examples by I.J.Goodfellow, J.Shlens & C.Szegedy

As can be seen in the image above, the GoogLeNet model predicted that the initial image was a Panda with a confidence of 57.7%, however, after adding the slight perturbation, even though there is no apparent visual change in the image, the model now classifies it as a Gibbon with a confidence of 99.3%.

The perturbation added above might appear to be a random assortment of pixels, however, in reality, each of the pixels in the perturbation have a value (represented as a color) that is calculated using complicated Mathematical algorithms. Adversarial Examples are not limited just to image classification models; they can also be used with audio and other types of files, however, the underlying principle remains the same as what has been explained above.

There are many different algorithms that have varying degrees of success on different types of models, and an implementation of many of these can be found in the Cleverhans library (Papernot et al.)

Generally, Adversarial attacks can be classified into one of two types:

  1. Targeted Adversarial Attack
  2. Untargeted Adversarial Attack

Targeted Adversarial Attack

A targeted Adversarial Attack is an attack in which the aim of the perturbation is to cause the model to predict a specific wrong class.

The image on the left shows the original image was classified correctly as being a Tabby Cat. Now, as part of the Targeted Attack that was conducted, the attacker decided that he would like the image to be classified as a guacamole instead. Thus, the perturbation was created in such a manner that it would force the model to predict the perturbed image as a guacamole and nothing else (i.e. Guacamole was the target class).

Untargeted Adversarial Attack

As opposed to a Targeted Attack, an Untargeted Adversarial Attack involves the generation of a perturbation that will cause the model to predict the image as something that it is not.However, the attacker doesn’t explicitly choose what he would like the wrong prediction to be.

An intuitive way to think about the difference between a Targeted Attack and an Untargeted Attack is that a Targeted Attack aims to generate a perturbation that will maximize the probability of some class other than the correct class which is chosen by the attacker (i.e. the target class) whereas an Untargeted Attack aims to generate a perturbation that will minimize the probability of the actual class to such an extent that the probability of some class other than the actual class becomes greater than the probability of the target class.

Ethical Concerns

As you read this, you might begin to think of some of the ethical concerns arising from Adversarial Examples, however, the true magnitude of these concerns only becomes apparent when we take a real-world example. We can take the development of self-driving cars as an example. Self driving cars tend to use some kind of deep learning framework to identify road signs and that helps the car perform actions on the basis of those signs. It turns out that by making actual minor alterations to the physical road signs, they too can serve as Adversarial Examples (it is possible to generate Adversarial Examples in the real world as well). In such a situation, one could modify a Stop sign in such a manner that cars would interpret it as a Turn Left sign, and this could have disastrous effects.

An example of this can be seen in the image below, where a physical change made to the Stop sign causes it to be interpreted as a Speed Limit sign.

Source: BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain by T.Gu, S.Dolan-Gavitt & S.Garg

As a result of this, it is with good reason that people see some ethical concerns with the development of technologies like self-driving cars. While this should in no way serve as an impediment to the development of such technologies, it should make us wary of the vulnerabilities in Deep Learning models. We must ensure that further research is conducted to find ways to secure models against such attacks so that advanced technologies that make use of deep learning (like self-driving cars) become safe for use in production.

--

--