Some Thoughts on Adversarial Attacks in Machine Learning

2 min readJun 19, 2023

An adversarial attack in machine learning is a process where “noise” that may appear random to a human is added to the input data to mislead the model and cause it to make incorrect predictions.

While these changes to the input data are often imperceptible to humans, they can drastically affect the performance of machine learning models. Needless to say, this can have devastating effects in high-stakes settings (see image above).

So how can we detect adversarial attacks? Can we design machine learning models that are resistant to such attacks? How can we balance the trade-off between robustness to adversarial attacks and model performance?

One approach is adversarial training, which involves training models on adversarial examples to improve their robustness. However, this often comes at the cost of model performance on non-adversarial data.

Another technique is defensive distillation, where a second model is trained to predict the output function (class probabilities) of a larger model. This reduces the gradient, which makes it harder to find adversarial examples.

There are also detection methods for adversarial attacks. These methods aim to identify when an input has been manipulated and could include statistical tests, checking the input’s proximity to the training data, or training a separate model to predict if an input is adversarial.

Studying adversarial attacks pushes us to think about the security of machine learning models. As these models are increasingly used in sensitive areas like healthcare, finance, and autonomous driving, ensuring their robustness against malicious attacks is of paramount importance.

Some Thoughts on Adversarial Attacks in Machine Learning

Written by Heinrich Peters