Fooling neural networks with adversarial examples

Published in

OUSPG

4 min readNov 9, 2018

This autumn I got to know that the state-of-the-art (neural network) AI classifiers can still be fooled like this:

Image from the article that first pointed out that fooling neural networks this way is possible (C. Szegedy et al., Intriguing properties of neural networks, 2014)., full resolution image is from http://goo.gl/huaGPb.

The image on the left is correctly classified as a truck by the neural network (AlexNet). The middle image represents the changes made to it, and the rightmost image is the result, labelled by the classifier as “ostrich”. That’s funny, right? But that’s not all: it works in the physical world, too. Guess what a neural network sees in this picture:

Image from the article Robust Physical-World Attacks on Deep Learning Visual Classification, by K. Eykholt et al., April 2018

The classifier is 77% sure that it’s a speed limit 45 sign. Is that amusing? Considering that self-driving cars might be coming? Don’t feel guilty, it amuses me a bit too, but it’s actually a damn serious matter. We don’t want self-driving cars out on our streets for as long as this is the case.

This kind of images and tapes on stop signs made specifically to fool neural networks are called adversarial examples. Their existence was noticed about the same time that neural networks developed to achieve state-of-the-art performance in AI. Yes, attempts have been made to defend against them. No, none have been successful. Some defenses have been promising at the time they were invented, but all have failed before long, as stronger ways of crafting adversarial examples have been found.

After studying adversarial examples and neural networks (a huge thanks to the Stanford University for having their course CS231n material available for us all!) I have realized something funny: some tools developed for the purpose of training neural networks are great for crafting adversarial examples. The funny (?) thing being, that after brilliant people have used so much time working on those tools and making neural networks awesome, the very same tools turn out so handy for “vandalism”.

Tools meant for training neural networks are also useful for attacking them. Image self-drawn!

For example: back-propagation is a key method that makes training (optimizing the parameters of) neural networks efficient. It is used for calculating how each parameter of neural network affects its output. That way it allows us to find out how the parameters should be changed to make the outputs better. Give back-propagation in hands of a villain though, he’ll use it to find out the effect of input pixels instead of the parameters! That way he’ll know how to twinge the pixels of the image in order to create a maximal change in the output. And as we’ve seen, if you know exactly what to change, it really doesn’t require much.

But do you need to know the neural network to do that? You need to know its structure and parameters to calculate those things, right? — Nope. The state-of-the-art defenses still aren’t robust against the black-box setting, either:

Black-box attacks still break the current defenses without difficulty. To be accurate, the case presented here might be a bit outdated: the mechanism through which the black-box attacks work on state-of-the-art defenses might not be through transferability. The main point is still true, though: black-box attacks do work. Picture self-drawn!

Google, too, sees this as an issue. This autumn they launched a challenge in hopes it will speed up the development of robust classifiers: https://ai.googleblog.com/2018/09/introducing-unrestricted-adversarial.html. Do you think you could build a classifier that never mistakes birds for bikes? If yes, you should participate. You’d be the first one.

The topic is hot, and the task is not easy. This area will obviously be a field of active study for the next months, years or even decades, and we can expect to see both new attacks and new defenses. I wonder how long it’ll take to fix the problem.

In the meanwhile though: have you noticed how you can search your Google photos for categories of pictures? I wonder if I could make it think my cat is a goldfish.

For those interested in studying further:

For those interested to learn about neural networks, I warmly recommend the material of Stanford University, course CS231. The lectures I watched were from 2016: https://www.youtube.com/playlist?list=PLkt2uSq6rBVctENoVBg1TpCC7OQi31AlC. The most recent course material is available here: http://cs231n.stanford.edu/.
The blog written by two big names studying adversarial examples: http://www.cleverhans.io/.
The article that pointed out the existence of adversarial examples in 2014: https://arxiv.org/abs/1312.6199.
An article related to the birds & bikes -challenge of Google: https://arxiv.org/abs/1809.08352.

Fooling neural networks with adversarial examples

Written by Miia Lillstrang