Deeper understanding of visual cognition via adversarial images
Deep learning for object recognition/image processing is one of the greatest technological marvels in recent years — from “super-human accuracy” in categorizing objects, day-dreaming robot artists, to sinister large-scale surveillance. It flops though, whenever it sees an adversarial image: one that is perturbated a tiny bit such that it stays the same to human eyes but induces an algorithm to give the wrong answer.
To a technologist, adversarial examples represent a threat: autonomous cars might get into an accident, robots might make a mess when presented with one. Researchers have demonstrated in more than one way that this is possible not only in computer but also in the physical world.
To a cognitive scientist, however, adversarial examples are an opportunity to better understand visual cognition. Why do we perceive the images in the left and the right columns as the same? One possible answer is that they are reduced to the same representation.
It is no secret that humans simplify things. Picaso knew this already in 1945 when he drew the famous series of abstraction of a bull, before the birth of modern cognitive sciences. Stick figures must have existed thousands of years earlier.
From our experience, it is natural to associate simplification with lines but, upon close inspection, it is unlikely that our visual system represents objects by them. Objects always come with a surface, not as wire-frames. The first times that a child holds a pencil, incongruent lines most likely come out, which is very surprising if they are the latent representation in the child’s mind. Lines are used not because they are natural but because they are easiest to draw without making the next strokes harder.
It is much more likely that the human mind simplifies things by scaling them down. Take the above photo and scale them down to 100×72 pixels, we can see why those very different drawings can be all taken to represent the same bull: they look very similar in a small scale. Being able to recognize small things also gives you an evolutionary advantage because when you see a lion large and clear, it might be already too late.
All machine learning models are based on some assumptions: linear classifiers cut the world into beehive-like regions, support vector machines avoid conflicts by keeping opposing sides as far as possible. This inductive bias, not the amount of data and computation applied, is at the heart of an algorithm and determines what problems it can crack.
The recent revolution in image recognition is unleashed by one such assumption called translational invariance: the same templates are slid around an image and at each place would tell if they find an edge, a circle, or a face.
Translational invariance is not alone though, evolution has had millions of years to devise many others such as invariance to color, viewpoint, illumination, and size. I believe that scale (size) invariance also enables another simple yet powerful trick: shape (what stays when you scale an image down) is more important than texture (what disappears).
Evidence for this preference, if necessary at all, is everywhere: babies are born with blurry vision, so all of us learn to recognize shapes first. In language, “bigger pictures” sound more important than “small details”. In twilight condition, we see only with our rod cells which can’t tell colors apart and even don’t see red light at all.
Equipped with this observation, I suspect a simple yet effective solution for adversarial attack is scale invariance and a bias towards shapes. Even if this doesn’t solve all adversarial cases, it is hugely interesting to see if we could replicate some more aspects of the human vision system.
One way to test this idea is to create a classifier that does the following:
- An incoming image is scaled down to a very small size, such as 20x20
- The original image and the small image is classified separately by two neural networks, resulting in two sets of probabilities
- The final probability assignment is calculated by:
p =α×p_small + (1-α)×p_original
where 0.5 < α < 1
It is interesting to note that in the original paper, Szegedy et al. used a blown-up version of MNIST digits (which are originally bi-level 20x20 images). As can be seen in subfigure (c) below, the enlarged image size and value range give them much more room to add noise to the simple digits. If they stuck to the original format, their paper and the scientific discussion that follows would have been very different, which highlights the importance of, well, small details.
Szegedy, Christian, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. “Intriguing properties of neural networks.” arXiv preprint arXiv:1312.6199 (2013).