Vision Transformers Use Case: Satellite Image Classification without CNNs

Published in

Nerd For Tech

6 min readMay 13, 2021

Convolutional neural networks have been widely used in computer vision tasks in recent years as the state of the art. Classification, detection and segmentation of images use convolutional filters to extract feature maps from the input images, providing elements for the next layers to perform their specified tasks. However, an architecture that does not use convolutions called Vision Transformers (ViT) has been showing encouraging results in tasks such as image classification and object detection. Is the CNN reign coming to an end?

Convolutional Neural Networks

To learn more about CNNs, visit the link below:

A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way

Artificial Intelligence has been witnessing a monumental growth in bridging the gap between the capabilities of humans…

towardsdatascience.com

Transformers

The paper ‘Attention Is All You Need’ introduces a novel architecture called Transformer. As the title indicates, it uses the attention-mechanism we saw earlier. Like LSTM, Transformer is an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder), but it differs from the previously described/existing sequence-to-sequence models because it does not imply any Recurrent Networks (GRU, LSTM, etc.).

An Introduction to Transformers and Sequence-to-Sequence Learning for Machine Learning

medium.com

Vision Transformers

The paper, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale presented a use of transformers for image classification. While CNN uses pixel arrays, Visual Transformer (ViT) divides the image into visual tokens. Firstly, Split an image into patches. Image patches are treated as words in NLP. We have patch embedding layers that are input to transformer blocks. The sequence of pictures will have its own vectors. List of vectors as a picture because a picture is 16 times 16 words region transformer.

“The image is divided into small patches here let’s say 9, and each patch might contain 16×16 pixels. The input sequence consists of a flattened vector ( 2D to 1D ) of pixel values from a patch of size 16×16. Each flattened element is fed into a linear projection layer that will produce what they call the “Patch embedding”.

Position embeddings are then linearly added to the sequence of image patches so that the images can retain their positional information. It injects information about the relative or absolute position of the image patches in the sequence.

An extra learnable ( class) embedding is attached to the sequence according to the position of the image patch. This class embedding is used to predict the class of the input image after being updated by self-attention.

The classification is performed by just stacking an MLP Head on top of the Transformer, at the position of the extra learnable embedding that we added to the sequence.” (https://www.analyticsvidhya.com/blog/2021/03/an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale-vision-transformers/)

ViT were also used for Object Detection in the paper: “End-to-End Object Detection with Transformers”. For more information access:

Are You Ready for Vision Transformer (ViT)?

“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” May Bring Another Breakthrough to Computer…

towardsdatascience.com

Vision Transformers — attention for vision task.

Recently there’s paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” on open-review. It…

becominghuman.ai

Will Transformers Replace CNNs in Computer Vision?

In a couple of minutes, you will know how the transformer architecture can be applied to computer vision with a new…

pub.towardsai.net

Transformers in Remote Sensing Image Analisys

In the analysis of Remote Sensing images, the use of ViT begins to be explored, for example, Bazi et al. (2021) used the Aerial image dataset (AID) and the Merced dataset for the task of image classification.

https://www.mdpi.com/2072-4292/13/3/516/htm

The results of this work showed that this architecture is very promising, obtaining higher values than consolidated architectures that use convolutions.

Study Case: EuroSat Database

We will use the Eurosat Dataset to evaluate the use of ViT in image classification. This dataset contains 27000 images with 13 spectral bands, divided into 10 different classes of land use and land cover.

https://arxiv.org/abs/1709.00029

We will use Google Colab to build and train our ViT. Taking advantage of Colab’s integration with Google Drive, we will store the dataset in Drive, being easily accessed. First, let’s plot a bar graph with the number of images per class:

We note that the classes contain between 2000 to 3000 images, being partially balanced. Thus, we can carry out the pre-processing steps and the construction of the models that we will use in this comparison.

NN Models:

Custom CNN
VGG16
ResNet-50
ViT

Custom CNN

The first model to be used in the classification of the Eurosat dataset will be a customized CNN, with some layers and parameters configured by the user.

After 300 training epochs, the test dataset has a acuraccy of 0.93.

So we can plot the confusion matrix by class, indicating the quantity predicted by the model:

VGG16

Now we are going to feed our data into a VGG16 and train it to verify the results:

We will also train this model for 300 epochs. The accuracy obtained in the test dataset was 0.95, slightly better than the customized CNN.

Confusion matrix:

ResNet-50

We will use ResNet50, a very famous architecture and widely used in computer vision tasks. Like the VGG16, the ResNet-50 is made available by Keras.

After 300 training epochs, the model showed an accuracy of 0.83 in the test dataset, well below the values obtained by previous models.

Confusion Matrix:

Visual Transformers

For the building of this model we will use the example of the link below, where the ViT architecture was created and trained in the cifar10 dataset:

Keras documentation: Image Classification with Vision Transformer

Author: Khalid Salama Date created: 2021/01/18 Last modified: 2021/01/18 Description: Implementing the Vision…

keras.io

So after using 100 epochs for training, the model obtained an accuracy of 0.92 in the test dataset.