CNNs vs Vision Transformers — Biological Computer Vision (3/3)

Published in

Bits and Neurons

8 min readJan 12, 2023

This article is the third and final article in my series — Biological Computer Vision where I try to explain and compare leading architectures in computer vision through a biological lens. In the previous articles, I have explained the workings of two state-of-the-art architectures — CNNs and Vision Transformers and the biological intent behind their design. In this article, I will contrast the two and discuss which one is closer to human vision.

Setting the scene

This article requires a basic understanding of the architectures discussed in the backdrop of classification. Here are earlier articles in this series that will catch you up if you need it:

To quickly recap on how these architectures achieve state-of-the-art performance and occasionally superhuman results in computer vision tasks today, we will briefly go over their distinct characteristics.

Convolutional Neural Networks (CNNs) are an improvement upon the simple Artificial Neural Networks (ANNs) through the significant addition of inductive biases derived from the animal visual system. The ability to ‘scan’ for important features in an image drastically reduce the computational cost while significantly boosting performance in CNNs. This is done by convolving the input image with feature maps which are trained to identify important features and passing the result to the next convolution layer of the neural network. In this way, it is fair to assume that primitive features like lines, edges and curves are prioritised by earlier layers in the network and more complex features like outlines and shapes are identified by the later layers. An important thing to note here is that since all the patches (that are convolved by the feature maps) are processed with the same weights, CNNs are translationally invariant — the features could be anywhere in the image (again, influenced by the visual cells in the animal cortex).

An inductive bias of a Machine Learning model is a set of assumptions it uses to compute a prediction given input. These assumptions are usually implemented in the model to affect the process of learning and prediction in a specific way that is expected to improve performance.

Vision Transformers (ViTs), on the other hand, work on an entirely different set of assumptions and inductive biases. They employ a variant of the cognitive mechanism of attention called self-attention. Here, each patch is queried with all other patches that make up the input image in order to find how significant it is given its context. In other words, it essentially enables the model to ‘pay attention’ to only a few parts of the image that it thinks is important for distinguishing between classes in a classification task.

Since the inductive biases deployed in these two models are so different, it leads us to the question — which of the models perform more similarly to humans? Do the models process visual information in the same way we do? Do they make the same mistakes we do? We shall find out!

Shapes and Textures

Despite CNNs being a very successful vision architecture and claiming to stay close to its roots from the animal visual cortex, recent work has showed that it differs from human visual processing in a drastic way. Hermann et al and other papers discovered that CNNs classify images on the basis of texture rather than shape. Humans, however, prioritise shape information for classification. When we are asked to determine if a given image of an animal is a dog or a cat, we weigh the shape or outline of the animal with greater importance than its skin or fur. CNNs do the opposite.

Performance of a CNN (ResNet50) on a texture-shape diagnostic dataset Source

It is evident that CNNs have a heavy texture bias as pictured above in its performance on a diagnostic dataset where the shapes of images were maintained and the texture was varied. A human would classify the image on the right as a cat due to the shape outline being maintained. This means that CNNs will have a difficult time classifying sketches or drawings. Why do CNNs end up having a bias towards texture over shape when this was not explicated in its design process? It turns out that the inductive biases chosen at design may have implications that can be surprising. The design choice to convolve parts of the image to search for features is at blame here. Convolution place a heavy emphasis on local connectivity rather than global context. The features that CNNs end up learning thus tend to be localised which explains the over-prioritisation of texture. This potentially sheds some light on what researchers call the texture hypothesis — that textures are the most differentiating aspects of an image in the classification paradigm.

This reasoning behind this CNN shape bias was made clearer with the advent of Vision Transformers. ViTs are designed with far looser inductive biases. The mechanism of self-attention is a very open-ended bias. It simply increases the relevance of certain information in the image. The exact information and regions being upweighted are all subject to be learned during training. In fact, it was found that at times Vision Transformers perform operations that are similar to convolutions by Cordonnier et al (going back to the texture hypothesis). However, this is not the case always and the important takeaway is that the inductive biases of a Transformer are far more flexible and not bound to convolutions. This leads to ViTs learning the importance of features within global context which are analogous to humans identifying shapes within the images for classification.

The above comparison result between humans, Vision Transformers (ViTs) and CNNs (The ‘Nets’). It is clear that Vision Transformers are closer to humans in terms of relying on shapes over texture. Source

Making Errors

There is a lot to learn about the strategy used by a classification algorithm — in our brains or inside a neural network — through the errors it makes. Most relevantly, we can compare how ‘human-like’ the strategies employed by CNNs and Vision Transformers are based on the images they misclassify. If there is consistency in the errors humans and a learning algorithm makes, we can infer that there is similarity in the learning strategies the algorithms use. To compare strategies, researchers (Geirhos et al.) have come up with an Error Consistency metric —Cohen’s Kappa, κ — which is computed based on probabilities of misclassification. Here is an amazing blog post by the author (Robert Geirhos) that sets up the context for such a metric. A κ value of 1 signifies an identical strategy and 0 suggests completely different strategies. It is important to note that this metric does not go into what the system misclassifies. To do that, Tuli et al have come up Jensen-Shannon (JS) Distances by taking into account class-wise errors as well. A greater distance implies a lower error consistency and vice versa. In an attempt to summarise all the above, Tuli et al conducted comprehensive experimentation on Vision Transformers and CNNs to compute their error consistency results.

Error consistency result comparison of Vision Transformers (Vit-B/32) and CNNs Source

The error consistency tells us how similar were the errors made by an architecture compared to humans. We can see that for Cohen’s Kappa and class-wise JS distances, Vision transformers score better in terms of error consistency (despite still being far from the human strategy) than CNNs. They are making similar misclassifications to humans compared to CNNs. The difference between the JS distances are not too relevant to the discussion and the inter-class comparison can be ignored as whole for now.

What does all of this mean?

The above sections have established that Vision Transformers seem to be more human-like than CNNs in terms of having a shape bias as well as greater error consistency with humans. What does this any of this tell us?

Inductive biases are critical

Firstly, all of this allows us to realise that we are still far away from truly human-like AI. Algorithms tend to learn in the most effective way it can given their inductive biases. Implementation of more ‘careful’ and human-like inductive biases may be the key. Nevertheless, we have gotten to a decent start with taking inspiration from animal visual pathways and cognitive attention. The next step might be to implement learnings from behavioural analysis in early humans (or more commonly, infants) into the inductive biases of a model. What we can gather from how a new-born child learns to use its eyes to make sense of the world could hold massive potential for inductive biases of future ML vision models.

Transformers are great at learning

Going in the opposite direction from my previous point, we also learn that Transformers are just really good at learning. This is evident as it first conquered the field of Natural Language Processing and now has started flexing its muscles in the field of Computer Vision. We have got our hands on one of the first high-performing architectures that can be adapted across domains. This has massive implications of getting closer to a general-purpose AI. We can theorise that this ability to generalise so well is thanks to its loose inductive biases. Although convolution in CNNs and analogous methods were thought to help in processing visual information, we have come to see that it brought its own drawbacks like the texture bias. Replacing convolution with a looser inductive bias of self attention, we have a much more powerful and robust mechanism of learning at the understandable cost of greater training cost. Maybe self-attention is just a very powerful base inductive bias and we can add more specialised inductive biases to keep improving models for domains. This can be seen in what seems to be the ‘reincarnation’ of convolutions in SWIN Transformers (the very recent state of the art).

In Conclusion

We have compared the two prevalent families of architecture in computer vision. On top of narrowly achieving better accuracy than CNNs, Vision Transformers also manage to be more ‘human-like’ than its counterpart. This was seen through an analysis of shape and texture bias as well as error consistencies across the architectures. We also learnt that the careful and appropriate selection of inductive biases are crucial in building better ML algorithms that use similar strategies to humans. However, as I write this a question floats in my mind — do we need AI to be human-like at all? Maybe the texture hypothesis is correct; the self-learning of texture importance in CNNs may be a more effective and efficient way of processing visual information despite it being different to how we do it. I will leave you with that thought and I hope you were able to take away something from this series.