From the Edge: Protecting Models from ‘Adversarial Attacks’

3 min readJul 28, 2020

This blog is part of a series on recent academic papers in the AI/ML community. By understanding what experts are spending their time researching, we will get a sense of the current limits and the future of the AI/ML world!

Researchers from a few different universities (George Mason, Amirkabir University of Technology, UC Davis, University of Maryland, and the Institute for Research in Fundamental Sciences) Durham University (in the UK) proposed a new method for protecting neural networks from ‘adversarial’ attacks.

Why is this interesting?

Well, as long as there is tech, there will be people trying to break or circumvent that tech. Neural networks are no different.

Adversarial attacks refer to applying specific changes to input images to get the networks to misclassify the images (but, in a predictable way). Basically, you can add patches or pixels to pictures that, because the network has never seem them before, will cause the network to misclassify those images (this is a really good blog post explaining adversarial attacks, if you want to learn more).

There are two primary accepted ways for dealing with adversarial attacks.

The first is adversarial training. It’s exactly what it sounds like — you take examples of adversarial images, and use those to train the network. The disadvantages of this approach are obvious — it may improve resiliency to known adversarial images, but not unknown ones.

The second is knowledge distillation. This involves training a second, ‘less sensitive’ model. Basically, this second model will follow the behavior of the first model, but with less certainty. The idea is that it would then be less responsive to adversarial attacks. However, because it is trained on the behavior of the first model, it will still have the same ‘idiosyncrasies’ of the first model, which make it susceptible to the same adversarial attacks.

The researches in this paper proposed a third approach!

Tell me the details!

This paper proposes a third method, which is a variant of method two. Similar to knowledge distillation, they train additional models on the predictions of the primary model. However, they also stipulate that the additional models must learn “different latent spaces” than the primary model. What are latent spaces? Well, basically, the other models must predict the outcome in different ways than the primary model.

Why would this work? Well, with this method, the researchers are trying to ensure that the models use “robust” features to predict classes. What is a robust feature? To illustrate, let’s consider a training set of images of dogs and cats. Hypothetically, every image of a dog could have the top left pixel be jet black (value of 0). The images of cats have top left pixels ranging from 1 (really really black) to 10 (really black) — this is on a scale of 255. A neural network could weight this pixel heavily, and in our training set, use it to predict dog vs. cat. This is an example of a non-robust feature — the differences between the classes are not very big, so small changes to the input image (changing the value by one) could change our prediction drastically! And, intuitively, the top left pixel should not really influence our prediction of dog vs. cat. Robust features are the exact opposite — features with significant differences between classes. Robust features would be the sort of features a human used to distinguish the images.

So, in our silly example, traditional knowledge distillation would reduce our sensitivity to this single pixel, but not eliminate the behavior. Our model would still be sensitive to this meaningless top-left pixel.

However, the new approach from the researchers would stipulate the our additional models had to use different methods to predict the classes — i.e., they could not rely on this pixel. So, the models would learn something different, and by doing this many times, the network of models can be less vulnerable to adversarial attacks.

So, what did they (and we) learn?

The proposed approach outperformed traditional approaches when faced with adversarial attacks. So, there is reason to be intrigued by this methodology.

More broadly, this shows the value of ‘many uncorrelated predictions’ in driving better models. This is a common approach in machine learning. Every training set of data will have ‘idiosyncrasies’ that we should ignore when applying to new problems (like the pixel example above). But, it is very hard to identify and account for these. So, machine learning will use techniques like ‘dropout’ (removing pieces of the model during training) to try to prevent the model from overemphasizing any one feature.

And you can read the paper yourself!: arXiv:2006.15127v1

From the Edge: Protecting Models from ‘Adversarial Attacks’

Written by Jake Tauscher