Vision Transformers from a Biological Standpoint — Biological Computer Vision (2/3)

Published in

Bits and Neurons

8 min readSep 14, 2022

This article is the second in my series — Biological Computer Vision where I try to explain and compare leading architectures in computer vision from a biological lens. In this article, I will tackle Transformers — “The CNN killers” — in the context of visual information. Check out my previous article on CNNs here.

Vision Transformers (ViTs) are the new kids on the Computer Vision block. In fact, they were introduced in the iconic “An Image is Worth 16x16 words” paper from 2020. ViTs are inspired from Transformers from the field of Natural Language Processing (paper link). The underlying mechanisms are quite similar for both architectures but ViTs have been modified in order to processes visual information instead of sequential information like sentences. ViTs took the Computer Vision field by storm because it took a very novel approach in processing visual information in Machine Learning tasks compared to previously established architectures like Convolutional Neural Networks (CNNs). Impressively, the ViTs achieved comparable (and sometimes better) performance than state-of-the-art CNNs in image classification tasks. The approach employed by ViTs involve the concept of attention which is modelled after our own biological, cognitive attention.

What is Attention?

Biological Attention

The attention we employ in our everyday activities is used as the direct inspiration for the core mechanism in Vision Transformers. Let’s take the example of a loud party. There is heavy music playing, multiple people talking over each other and yet we are able to keep up a conversation with a friend next to us. How are we able to navigate our way and make decisions through this overload of sensory information? The answer is attention. Our brain decides what information is important and what is not (what we pay attention to and what we do not). Let’s take a more relevant example for the case of visual information when we are driving a car. We are overloaded with visual information — the cars in front of us, the bike in our rear view mirror, the traffic lights, the pedestrians waiting to cross the road, the traffic signs on either side of the road, the divider, the curb, the dials and indicators in our own car and so many more. How are we possibly able to drive safely with so many visual stimuli to sift through? Our brain decides what to pay attention to dependent on our current activity. When we are trying to overtake a car, we pay attention to the car in front of us and the lane we will use to overtake. If we are trying to take a left at a pedestrian intersection, we pay attention to the sidewalks for any pedestrians and check if there are any cars are coming from our right before turning left.

You are right in wondering that the two examples of attention are different since the first one where we filter out sounds in the party seems innate but the attention we pay to the roads while we drive is practiced. Despite being trained or not, the fact that our brain is able to decide what sort of information is important while making a decision and weigh it with more importance than the less useful information is of importance here.

It is believed that we are able to pay attention to certain features (from my previous article!) in our visual field with the help of neurons that only fire when the corresponding feature is presented to the visual field (paper). A study of feature-based visual attention showed that when prompted subjects to search for a ‘Blue X’ in their visual field where an array of non-targets and the target was presented, two types of neurons fired. These were cells that ‘preferred’ blue targets and cells that ‘preferred’ X-shaped targets. These feature-based cells help produce an attention map — an area in the visual field that our brain thinks has the important information (important here is the likely presence of the blue X target).

The study shows us how human cognition employs mechanisms to compute an attention map of the sensory representations which in turn, help us make decisions more efficiently as the attention map only contains ‘the important sensory information’.

Attention in Machine Learning

Machine Learning Researches have tried to model our cognitive attention algorithmically for various tasks like visual information and Natural Language Processing (NLP). In the case of NLP, attention was first brought to prominence in a sequence to sequence model (Seq2Seq) where an input sequence is used to output a predicted output sequence — like in the case of language translation. Here, the model knows which parts of the input sequence need more attention in relation to the output sequence so that the model can learn to compute the predicted sequences faster and more efficiently.

Transformers were first introduced in Machine Learning to improve upon NLP tasks. These transformers used the concept of attention, more specifically — Self Attention. This variant of attention involved computing attention scores for words in the input sequence itself (traditional computational attention previously used in Seq2Seq paid attention to the input words in relation to the output). Self Attention computes finds correlations between different words in the input sequence which indicate the syntactic and contextual structure of the sentence. This Self Attention is learned over time as the model gets more data and optimises the weights used for calculating attention.

Computing Attention

The formula above is the standard expression for calculating scaled dot-product attention. Q, K, V correspond queries, keys and values which can be thought of like in any information retrieval system. The Query is the word we are interested in calculating the attention for in relation to the rest of the words in the sequence. Keys and Values are other words in the sequence that we want to compare Q with. d_k is the scaling factor that is necessary to ensure that the vectors do not get too large. The softmax function scales the output from 0 to 1 so we can express attention as a probability distribution.

In plain English, the formula ensures that for a given element in the Query set (all the words in the sequence), we compare it with every other word in the sequence. Note that we compare the query to the other words (the dot products) twice. This is so that we first get a sense of which words are even relevant enough for the attention mechanism to focus on before actually computing the attention. The final attention vector for the queries tells us how each of the query words are related to and important in terms of the other words in the sequence (in the case of self-attention).

How do Vision Transformers use (Self) Attention?

Key Idea

Vision Transformers proposed by Dosovitskiy et al. use self attention to classify images. The key mechanism that it adopts from NLP Transformers is that it splits up an image into patches and figure out how important features in one patch are in relation to features in other patches.

Essentially, ViTs learn how to pay attention to different regions of the images to identify certain necessary features that helps the model make any necessary decisions (like the classification problem).

The Pipeline

Vision Transformers use a very intuitive pipeline to process images using self-attention. The image classification pipeline is described in brief below:

via Dosovitskiy et al. ignore the Norm and Skipped layers — they are used for Machine Learning Optimisation reasons and are quite irrelevant to us in this article!

The ViT splits the input image into 16x16 patches and are flattened into patch embeddings (1D vectors) so that they can be processed mathematically
The ViT has no way remembering where the the patch was located in the original image. So each patch goes through positional encoding so that the positional vector is appended to the patch embedding. This way, attention can be computed positionally as well and not just semantically!
Each sequence of patch embeddings is prepended with a class token in training. For example, if the image is of a bird, a unique token chosen for a bird is added to the sequence of patches. These tokens are learnable as well so that the attention is computed involving the class token. In other words, the process will tell us how each patch and features are important to the class. E.g. a bird will have circular shaped features (eyes) in the early patches so these features and patches are weighed heavily for a bird class token.
The sequence of patch embeddings and the class token are then fed into the Transformer Encoder. The important part of the encoder is the Multi-Head Attention (MHA) Layer. The MHA Layer is where the heavy lifting of computing self attention occurs. The Multi-Head refers to multiple Self Attention layers working in parallel on the same information. In these layers, the weights for the Queries, Keys and Values are learned in the training process and are used to compute the self attention. The combined output of multiple layers give us a richer representation of attention as they all work with unique weights. The output at this stage can be visualised as an attention map. These maps help us understand which parts of the image the ViT is ‘paying more attention’ to for certain images (pictured below).

Attention maps generated for images from Vision Transformers via Viso

The output from the MHA layer is fed into the MLP Head which is a Multi-Layer Perceptron. This is just a fancy way of calling a small neural network. This neural network finally does the classification of the input image into one of the classes provided to it based on the attention maps.

Concluding Thoughts

The self-attention layers in the Multi Head Attention component of the Transformer encoder is analogous to the feature-preferring neurons in our own cognitive attention mechanism. Like how our mechanism tells us what parts of our visual field to pay attention to for driving a car safely or what sounds to listen to for carrying out a conversation with our friends in a loud environment, the MHA Layer tells the Vision Transformer which parts of the image is important for classifying it.

In theory, the fact that we are discarding parts of the image based on the learned notion of importance should make Vision Transformers efficient. This is evident in faster training times for ViTs compared to state-of-the-art Convolutional Neural Networks (CNNs). However, this comes at the cost of a larger dataset required to reach comparable accuracies. It takes more data for the ViTs to learn all the necessary attention which is likely to be optimised in the future — we are still in the infancy of Vision Transformers!

Vision Transformers seem to be a promising step into developing Artificial Intelligence in ways that resemble our own ways of processing information. We have adopted a very central mechanism in our cognitive process in how we process information to successfully process information effectively in machines. This takes us one step closer to a more general, human-like and real artificial intelligence.

In my next article, I will be covering which of the two leading architectures in Computer Vision — CNNs or ViTs — are more like how humans process visual information.