Mostafa Gazar
Feb 26 · 2 min read


One widely accepted intuition is that CNNs combine low-level features like edges to increasingly complex shapes such as wheels, car windows until the object like a car can be readily classified, they termed that as shape hypothesis. This hypothesis is supported by a number of empirical findings. Visualisation techniques like De- convolutional Networks often highlight object parts in high-level CNN features.

CNNs can still classify texturised images perfectly well, even if the global shape structure is completely destroyed. Conversely, standard CNNs are bad at recognising object sketches where object shapes are preserved yet all texture cues are missing. Two studies suggest that local information such as textures may actually be sufficient to “solve” ImageNet object recognition, a linear classifier on top of a CNN’s texture representation achieves hardly any classification performance loss compared to original network performance. Let us call this texture hypothesis.

Resolving these contradictory hypotheses will help increase our understanding of neural network decisions.

To quantify texture and shape biases in both humans and CNNs they utilised style transfer to create images with a texture-shape cue conflict such as cat shape with elephant texture. Totalling 48,560 psychophysical trials across 97 observers. These experiments provide behavioural evidence in favour of the texture hypothesis: A cat with an elephant texture is an elephant to CNNs, and still a cat to humans.


The authors offered a new way of thinking about how machine recognition work, and you end up with this new intuition that shape-based representation may be more beneficial than a texture-based one for more robust inference.


They did not mention how this might generalise to other problems/datasets.


All psychophysical experiments were conducted in a well-controlled psychophysical lab setting. In each trial participants were presented a fixation square for 300 ms, followed by a 300 ms presentation of the stimulus image. After the stimulus image we presented a full-contrast pink noise mask (1/f spectral shape) for 200 ms to minimise feedback processing in the human visual system and to thereby make the comparison to feedforward CNNs as fair as possible.

1-minute papers

Random deep learning papers summary.

Mostafa Gazar

Written by

An Android Pro, built million-downloads app. Y-Combinator alumni. I write about AI and Android

1-minute papers

Random deep learning papers summary.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade