ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness
Retrieved from https://arxiv.org/abs/1811.12231
One widely accepted intuition is that CNNs combine low-level features like edges to increasingly complex shapes such as wheels, car windows until the object like a car can be readily classified, they termed that as shape hypothesis. This hypothesis is supported by a number of empirical findings. Visualisation techniques like De- convolutional Networks often highlight object parts in high-level CNN features.
CNNs can still classify texturised images perfectly well, even if the global shape structure is completely destroyed. Conversely, standard CNNs are bad at recognising object sketches where object shapes are preserved yet all texture cues are missing. Two studies suggest that local information such as textures may actually be sufficient to “solve” ImageNet object recognition, a linear classifier on top of a CNN’s texture representation achieves hardly any classification performance loss compared to original network performance. Let us call this texture hypothesis.
Resolving these contradictory hypotheses will help increase our understanding of neural network decisions.
To quantify texture and shape biases in both humans and CNNs they utilised style transfer to create images with a texture-shape cue conflict such as cat shape with elephant texture. Totalling 48,560 psychophysical trials across 97 observers. These experiments provide behavioural evidence in favour of the texture hypothesis: A cat with an elephant texture is an elephant to CNNs, and still a cat to humans.
The authors offered a new way of thinking about how machine recognition work, and you end up with this new intuition that shape-based representation may be more beneficial than a texture-based one for more robust inference.
They did not mention how this might generalise to other problems/datasets.
All psychophysical experiments were conducted in a well-controlled psychophysical lab setting. In each trial participants were presented a fixation square for 300 ms, followed by a 300 ms presentation of the stimulus image. After the stimulus image we presented a full-contrast pink noise mask (1/f spectral shape) for 200 ms to minimise feedback processing in the human visual system and to thereby make the comparison to feedforward CNNs as fair as possible.