Adversarial machine learning: when artificial intelligence goes wrong

Paul Lezeau
7 min readOct 14, 2022

--

Introduction

In recent years, one has heard with increasing frequency headlines of the form “Artificial intelligence surpasses humans at _______”. While one would be hard pressed to deny the huge advances that have been made in the area, this short article seeks to balance such claims by delving into some weaknesses of deep learning models.

As artificial intelligence (AI) takes an increasingly prevalent place in various parts of society, the need to ensure that it is safe and robust grows ever stronger. By robust, we mean the capacity to perform well under unforeseen circumstances, which are of course inevitable in everyday life. It is not enough that a self-driving car performs well in a well-controlled environment: it must also be capable of facing the chaos of streets in modern cities, and the myriad of jumbled information that comes with. Equally pressing is the need for AI powered systems used in sensitive applications to be resilient to potentially malicious encounters. In other words, we need our AIs to be fool proof. This concern is of course well known to the AI community, and is currently at the centre of the attention of many researchers, under the name of Adversarial robustness.

Robustness in ML

Part of the impetus behind current research in adversarial robustness is that many “standard” machine learning models are highly vulnerable to adversarial attacks. The term adversarial attack can be understood as an intentional attempt to fool an AI into outputting erroneous results. Of course, methods of adversarial attack come in all forms and shapes, and we won’t even attempt to list them exhaustively. Rather, we aim to give the reader an idea of what kind of threats exist through some examples and explain their relevance to applications of machine learning (ML), with an emphasis on deep learning.

Data poisoning

Data occupies a central place in deep learning, and its quality determines to a large extent how well a model can perform. This means that a rather obvious approach to fooling someone’s model would be to tamper with their training data. Such tampering could be done in a variety of different ways, depending on the intent of the attacker (whether the attacker is trying to affect the model’s performances in general, or on a specific type of input — e.g., does the attacker want the model to only misclassify certain inputs, or do they want to render the classifier useless?), their knowledge of the model, etc. In certain cases, even very small changes to a dataset can be enough to significantly alter the behaviour of a model: some researchers [1] were able to cause systematic misclassification of certain images by making small (less than 8 pixels per picture) modifications to 0.1% of the training dataset!

This type of attack is known as data poisoning, and is particularly threatening in cases where users can have some influence on the dataset feeding a model, such as recommender systems. Creating a bunch of fake accounts on Twitter to make large quantities of posts push certain hashtags up the leader board is an instance of data poisoning, just as reporting a large quantity of non-spam email as spam is. Mitigating the threat posed by data poisoning is essential, since this type of attack can be both dangerous, leading models to misbehave, and costly to fix, especially for large models that extremely expensive to train (roughly $12M for GPT3!).

Evasion

As we have just seen, tampering with the training of a model can pose a credible threat to its performances, making the security of this process a matter of great importance. Supposing we are given a model which has been trained carefully (on secure data), the next logical step would be to wonder whether it is still possible to fool it: after all, well-trained models have been known to attain very high accuracy in their respective tasks. Unfortunately (or fortunately for people trying to make a career in adversarial ML), the answer turns out to be in the positive.

This is a well-known fact in the community of spammers: removing/replacing “spammy” keywords (or altering them, e.g., replace “Nigerian prince” by “Nigeri@n pr1nce”) from an email can be enough to fool a detector into classifying it as non-spam.

More preoccupying however is the possibility of creating so-called adversarial examples: inputs specifically designed to fool a model into giving the wrong output, which one could think of as the machine learning equivalent of optical illusions. One of the most famous instances of this is fooling a CNN being used for image classification. For the sake of being concrete, let’s consider a CNN trained on the MNIST dataset to classify pictures of numerals. The gist of the issue here is that we could take any picture of a numeral, say 5, and cunningly modify it in a way that doesn’t change it’s class (i.e. the picture still shows a 5), but fooling the model into classifying it as a 9 (or some other numeral — with more effort, one could actually chose which digit the modified image will be classified as), with very high confidence.

Original picture
Adversarial example

For instance, the above adversarial example, generated using code found here, was classified as a 3 with 98% confidence when tested on a variant of the famous LeNet5 classifier. If you would like to learn where these examples come from and how to generate them, check out this tutorial.

Of course, figuring out how to modify the picture to do so isn’t obvious and requires some thought, but the existence of such flaw is, as discussed earlier, highly problematic. Worse still, moving to a more complex (e.g., more hidden layers) model doesn’t change much, nor does the problem disappear when we train the model on a larger dataset or consider more complex inputs (e.g., higher resolution pictures). In fact, recent research has shown that it is possible to create such machine-fooling-inputs for various types of data such as audio. As for data poisoning, there are many ways to craft adversarial examples, depending on the intentions of the attacker (e.g., do they simply want the model to misclassify a certain input, or do they want more control over what class the modified input will be given?), their knowledge of the model, etc.

Adversarial examples are an instance of evasion attacks, which aim to fool an already trained model during the decision phase. To emphasize a point that had already been mentioned, it’s worth noticing that data poisoning and evasion attacks exploit vulnerabilities at different stages of the machine learning workflow: the first interferes with the training phase, while the second aims to disrupt the model’s outputs once it has been deployed. Of course, other types of attacks exist (for example, another class we will not discuss here is that of so-called exploratory attacks, which aim to gain information about the model or the data used to train it), making the task of designing robust approaches to ML rather hard.

Conclusion: towards more robust models

The discovery of flaws such as those discovered above in deep learning models has initiated a large effort to understand and mitigate these new threats. Over the past few years, many approaches to make more robust models have been discovered, with the caveat that to this day, each solution seems to have its own drawbacks, often resulting in vulnerabilities that could be exploited.

An interesting aspect of research in adversarial robustness is that it provides some insight into how neural networks learn and make predictions: understanding the sources of failure is a great way to gain a deeper understanding of the inner workings of deep learning models. For instance, some recent papers (e.g. [2, 3]) distinguish between robust and non-robust features learnt by deep learning models, with the latter being the ones fooled by adversarial examples, with interesting consequences for interpretability of models: robust features tend to be better aligned to human perception [4]. At a more fundamental level, a mathematical study of these issues indicates that there exists a certain trade-off between the accuracy and the robustness of a model: making a model more robust has a cost in terms of its accuracy.

The recent discoveries in adversarial robustness in deep learning come to challenge simplistic views on how well our current deep learning techniques perform, and their ability to grasp and model our world. A sceptic observer of current AI developments might interpret this as proof that despite all its wonderful achievements, deep learning is nothing but a giant with clay feet. To balance such a pessimistic outlook, it is worth remembering that in matters of science, challenges have often been at the source of progress. It is when cracks appear in our theories that new ideas can seep in and distil. In the famous words of Thomas Kuhn, “In science, […] novelty emerges only with difficulty, manifested by resistance, against a background provided by expectation.”

Check out this link to learn how to generate adversarial examples.

Bibliography

[1] Witches’ Brew: Industrial Scale Data Poisoning via Gradient Matching, https://arxiv.org/abs/2009.02276

[2] There Is No Free Lunch In Adversarial Robustness (But There Are Unexpected Benefits), https://arxiv.org/abs/1805.12152v1

[3] Adversarial Examples Are Not Bugs, They Are Features, https://arxiv.org/abs/1905.02175

[4] There Is No Free Lunch In Adversarial Robustness (But There Are Unexpected Benefits), https://arxiv.org/abs/1805.12152v1

--

--