Convolutional neural networks have been widely used in computer vision tasks in recent years as the state of the art. Classification, detection and segmentation of images use convolutional filters to extract feature maps from the input images, providing elements for the next layers to perform their specified tasks. However, an architecture that does not use convolutions called Vision Transformers (ViT) has been showing encouraging results in tasks such as image classification and object detection. Is the CNN reign coming to an end?
Convolutional Neural Networks
To learn more about CNNs, visit the link below:
A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way
Artificial Intelligence has been witnessing a monumental growth in bridging the gap between the capabilities of humans…
The paper ‘Attention Is All You Need’ introduces a novel architecture called Transformer. As the title indicates, it uses the attention-mechanism we saw earlier. Like LSTM, Transformer is an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder), but it differs from the previously described/existing sequence-to-sequence models because it does not imply any Recurrent Networks (GRU, LSTM, etc.).
What is a Transformer?
An Introduction to Transformers and Sequence-to-Sequence Learning for Machine Learning
The paper, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale presented a use of transformers for image classification. While CNN uses pixel arrays, Visual Transformer (ViT) divides the image into visual tokens. Firstly, Split an image into patches. Image patches are treated as words in NLP. We have patch embedding layers that are input to transformer blocks. The sequence of pictures will have its own vectors. List of vectors as a picture because a picture is 16 times 16 words region transformer.
“The image is divided into small patches here let’s say 9, and each patch might contain 16×16 pixels. The input sequence consists of a flattened vector ( 2D to 1D ) of pixel values from a patch of size 16×16. Each flattened element is fed into a linear projection layer that will produce what they call the “Patch embedding”.
Position embeddings are then linearly added to the sequence of image patches so that the images can retain their positional information. It injects information about the relative or absolute position of the image patches in the sequence.
An extra learnable ( class) embedding is attached to the sequence according to the position of the image patch. This class embedding is used to predict the class of the input image after being updated by self-attention.
The classification is performed by just stacking an MLP Head on top of the Transformer, at the position of the extra learnable embedding that we added to the sequence.” (https://www.analyticsvidhya.com/blog/2021/03/an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale-vision-transformers/)
ViT were also used for Object Detection in the paper: “End-to-End Object Detection with Transformers”. For more information access:
Are You Ready for Vision Transformer (ViT)?
“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” May Bring Another Breakthrough to Computer…
Vision Transformers — attention for vision task.
Recently there’s paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” on open-review. It…
Will Transformers Replace CNNs in Computer Vision?
In a couple of minutes, you will know how the transformer architecture can be applied to computer vision with a new…
Transformers in Remote Sensing Image Analisys
In the analysis of Remote Sensing images, the use of ViT begins to be explored, for example, Bazi et al. (2021) used the Aerial image dataset (AID) and the Merced dataset for the task of image classification.
The results of this work showed that this architecture is very promising, obtaining higher values than consolidated architectures that use convolutions.
Study Case: EuroSat Database
We will use the Eurosat Dataset to evaluate the use of ViT in image classification. This dataset contains 27000 images with 13 spectral bands, divided into 10 different classes of land use and land cover.
We will use Google Colab to build and train our ViT. Taking advantage of Colab’s integration with Google Drive, we will store the dataset in Drive, being easily accessed. First, let’s plot a bar graph with the number of images per class:
We note that the classes contain between 2000 to 3000 images, being partially balanced. Thus, we can carry out the pre-processing steps and the construction of the models that we will use in this comparison.
- Custom CNN
The first model to be used in the classification of the Eurosat dataset will be a customized CNN, with some layers and parameters configured by the user.
After 300 training epochs, the test dataset has a acuraccy of 0.93.
So we can plot the confusion matrix by class, indicating the quantity predicted by the model:
Now we are going to feed our data into a VGG16 and train it to verify the results:
We will also train this model for 300 epochs. The accuracy obtained in the test dataset was 0.95, slightly better than the customized CNN.
We will use ResNet50, a very famous architecture and widely used in computer vision tasks. Like the VGG16, the ResNet-50 is made available by Keras.
After 300 training epochs, the model showed an accuracy of 0.83 in the test dataset, well below the values obtained by previous models.
For the building of this model we will use the example of the link below, where the ViT architecture was created and trained in the cifar10 dataset:
Keras documentation: Image Classification with Vision Transformer
Author: Khalid Salama Date created: 2021/01/18 Last modified: 2021/01/18 Description: Implementing the Vision…
So after using 100 epochs for training, the model obtained an accuracy of 0.92 in the test dataset.
As in the works presented above, our case study demonstrates that the Vision Transformers results for image classification are very exciting. There are other works that combine convolutions and ViT for computer vision tasks, but as we have seen in many articles it may be that the reign of convolutional networks is coming to an end…
Follow me on LinkedIn: https://www.linkedin.com/in/jo%C3%A3o-otavio-firigato-4876b3aa/