Are Transformers outperforming CNNs ?

Published in

SRM ACM Women

8 min readDec 31, 2020

As attractive and exciting as the two AI domains of Computer Vision and Natural Language Processing sound, how enthralling would it be if these two fields came together to help each other grow? In this post I would like to briefly explain, without diving into too much technical detail, the significance of the paper “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale” (still under review, submitted to 2021 ICLR conference anonymously due to the double-blind review requirements).

Image Source: https://soulpageit.com/computer-vision-and-the-future-of-work/

Over the past few years, Deep Learning has become an attractive subdomain of Machine Learning that has proven to outperform traditional Computer Vision algorithms in a wide range of applications such as object detection and classification, semantic segmentation apart from other useful applications such as navigation guidance. During the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) , Convolutional Neural Networks (CNNs) have shown their promising ability to accurately detect and localize a variety of object types. In 2012, an advanced pre-trained CNN model called the AlexNet [1] exhibited the best results in the image classification challenge. Following this, several CNN models have been proposed which have ever since dominated the ILSVRC competition — Visual Geometry Group (VGGNet) [2], Residual Neural Network (ResNet) [3], Dense Convolutional Network (DenseNet) [4], Xception [5], MobileNet [6], NASNet [7] and MobileNet [8], to name a few. In addition, CNN methods have also been employed to perform MRI tumor segmentation tasks [9, 10].

Convolutional Neural Networks have revolutionized the field of Computer Vision over the past few years. It had been accepted and established that convolutions are the go-to operations in order to perform tasks on images — be it recognition, detection, classification, image generation, and so on.

With the introduction of Transformers in the paper “Attention is All You Need” (2017), the ImageNet moment was achieved in the field of Natural Language Processing. This paper introduced the concept of attention in sequences. Transformers have inspired breakthroughs such as BERT, GPT-2, GPT-3, and T5 in the field of NLP.

The Transformer architecture — Figure 1: From ‘Attention Is All You Need’ by Vaswani et al.

What is Attention, after all?

While recurrent models (like RNNs, GRUs, LSTMs) typically take in a sequence in the order it is given and use that to output a sequence, they fail to allow for parallelization within training examples which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. In other words, if we rely on sequences and we need to know the beginning of a text before being able to compute its ending, we can not use parallel computing. One would have to wait until the initial computations are complete. So if the sequence is too long:

it will take a long time to be processed
we lose a good amount of information mentioned earlier in the sequence

For instance: In a statement like “I offered Tiya a muffin, but she refused to take it”, a human reader clearly interprets that “her” refers to Tiya whereas “it” indicates the muffin. However, for a model based on finding patterns in nearby data, this relation may be impossible to detect. Attention mechanisms have, therefore, become critical for sequence modeling in various tasks, allowing modeling of dependencies without caring too much about their distance in the input or output sequences. So unlike RNNs which process short sequences, attention works particularly well on language data as it keeps track of longer-term dependencies. They also aggregate context extremely well.

At a high level, transformers are composed of an embedding layer, an encoder and a decoder.

The embedding layer embeds each input word into fixed-size vectors (with positional information for each vector), which are passed through the encoder, yielding an abstract continuous representation.
The encoder comprises a multi-headed attention layer and a multilayer perceptron. The multi-headed attention layer learns the association of every word with all other words. In the above example, ‘her’ implies Tiya while ‘it’ refers to muffin. It’s the job of the multi-headed perceptron to learn that. The input is split into several heads so that each head can learn different levels of self-attention. Hence, the name multi-headed attention. The outputs of all the heads are concatenated and passed through the multi-layer perceptron.
The decoder converts this abstract representation into meaningful outputs and has a masked layer within it.
A few implementation details include the Layer Normalization used in Transformers and the stacking of several encoder layers to boost its predictive power.

I highly recommend Jay Alammar’s blog, “The Illustrated Transformer”, for a detailed and elaborate explanation on the topic.

For certain types of sequential data where attention is less significant (such as time-series forecasting, stock-prices or daily sales data), recurrent networks are still competitive and may prove to outperform other models.

While Transformers have shown state-of-the-art results and become the de-facto standard for NLP tasks such as translation, sentiment analysis, classification, conversational AI, etc., there have been experiments on trying to use them for images as well. Though their applications are limited, in vision, attention is either applied in conjunction with Convolutional Networks, or are used to replace specific components of ConvNets while preserving their overall structure in place.

It has been shown that the reliance on CNNs is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches. Dependencies between distant objects/points might be of significance in sequence models but they cannot be neglected in image tasks. For instance, it is important to have information about all the parts of an image.

The main reason why attention models have not exhibited excellent performance (yet) in Computer Vision lies in the fact that they have a complexity of N², which implies that a full set of attention weights between pixels of a 1000x1000 image would have a million terms (and that’s for ONE image!). Additionally, individual pixels in an image don’t carry a lot of information by themselves so using attention to connect them does not seem to accomplish much.

The schematic of the Vision Transformer (from https://openreview.net/pdf?id=YicbFdNTTy)

The paper An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale suggests an approach of using attention not on pixels, but on small patches of the image (presumably 16x16 as the title states, although the optimal patch size would depend on the dimensions and contents of the image to which the model is applied).

Since we’re dealing with images and not words, the input image is first divided into patches, which are then flattened and sequentially passed through a trainable linear projection layer (this imitates the embedding layer). The linear projection layer can be any network- several layers of ResNets, VGGNets, etc. The authors of the paper call it the Hybrid Model. This layer outputs fixed-size vectors, followed by their positional information. This is fed into the Transformer. An extra class token is concatenated to the inputs (position 0 in the image) as a placeholder for the class to be predicted in the classification task.

The fully-connected MLP head at the output provides the class prediction. As Transfer Learning is a common practice nowadays, the main model can be pre-trained on a large dataset of images, and then the final MLP head can be fine-tuned to a specific task via the standard transfer learning approach.

Results:

As one would expect, the bigger model, the better the results.

Though the training time is shown to be 2.5k TPUv3-days (wow!), it’s less compared to SOTA BiT-L and Noisy Student models.

Accuracy increases on increasing the number of samples

Another significant feature of the new model is that it is more efficient than convolutional networks with respect to attaining the same accuracy of prediction with significantly lower computation. The size of the dataset also impacts the results obtained through the Vision Transformer paper. As we provide the model with more and more data, its performance keeps improving- this is at its best when pre-trained with Google’s private JFT-300M dataset containing 300 Million (woah) images, resulting in SOTA accuracy of various benchmarks (ImageNet only has 14 Million Images). It is also observed that the Top 1 accuracy increases as we increase the number of samples in the JFT dataset.

It’s amazing to see the wonderful advancements and such intriguing amalgamations of Computer Vision with Natural Language Processing. Let’s hope for this pre-trained model to be made publicly available soon.

That’s it, folks! Hope you had a good read. Stay tuned for more such articles.

Feel free to check out my GitHub profile : github.com/devangi2000

Resources :

Krizhevsky A, Sutskever I, Hinton GE (2017) ImageNet classification with deep convolutional neural networks, vol 60. Association for Computing Machinery. https://doi.org/10.1145/3065386
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2016- Decem. https://doi.org/10.1109/CVPR.2016.90
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings — 30th IEEE conference on computer vision and pattern recognition, CVPR 2017, vol 2017-January. Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/CVPR.2017.243
Chollet F (2017) Xception: deep learning with depth wise separable convolutions. In: Proceedings–30th IEEE conference on computer vision and pattern recognition, CVPR 2017, vol 2017-January. Institute of Electrical and Electronics Engineers Inc. https://doi. org/10.1109/CVPR.2017.195
Howard AG, Zhu M, Chen B, Kalenichenko D,WangW ,Weyand T, Andreetto M, Adam H (2017) MobileNets: efficient convolutional neural networks for mobile vision applications
Zoph B, Vasudevan V, Shlens J, Le QV (2018) Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 8697–8710. https://doi.org/10.1109/CVPR. 2018.00907
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 4510–4520. https://doi.org/10.1 109/CVPR.2018.00474
Naceur MB, Saouli R, Akil M, Kachouri R (2018) Fully automatic brain tumor segmentation using end-to-end incremental deep neural networks in MRI images. Comput Methods Programs Biomed 166:39–49. https://doi.org/10.1016/j.cmpb.2018.09.007
Havaei M, Davy A, Warde-Farley D, Biard A, Courville A, Bengio Y, Pal C, Jodoin PM, Larochelle H (2017) Brain tumor segmentation with deep neural networks. Med Image Anal 35:18–31. https://doi.org/10.1016/j.media.2016.05.004
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin — Attention is All you Need : https://arxiv.org/abs/1706.03762

Are Transformers outperforming CNNs ?

Results:

Resources :

Written by Devangi Purkayastha