Vision Transformers Use Case: Satellite Image Classification without CNNs

Vision Trasnformers Architecture

Convolutional neural networks have been widely used in computer vision tasks in recent years as the state of the art. Classification, detection and segmentation of images use convolutional filters to extract feature maps from the input images, providing elements for the next layers to perform their specified tasks. However, an architecture that does not use convolutions called Vision Transformers (ViT) has been showing encouraging results in tasks such as image classification and object detection. Is the CNN reign coming to an end?

Convolutional Neural Networks

To learn more about CNNs, visit the link below:


The paper ‘Attention Is All You Need’ introduces a novel architecture called Transformer. As the title indicates, it uses the attention-mechanism we saw earlier. Like LSTM, Transformer is an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder), but it differs from the previously described/existing sequence-to-sequence models because it does not imply any Recurrent Networks (GRU, LSTM, etc.).


Vision Transformers

The paper, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale presented a use of transformers for image classification. While CNN uses pixel arrays, Visual Transformer (ViT) divides the image into visual tokens. Firstly, Split an image into patches. Image patches are treated as words in NLP. We have patch embedding layers that are input to transformer blocks. The sequence of pictures will have its own vectors. List of vectors as a picture because a picture is 16 times 16 words region transformer.

“The image is divided into small patches here let’s say 9, and each patch might contain 16×16 pixels. The input sequence consists of a flattened vector ( 2D to 1D ) of pixel values from a patch of size 16×16. Each flattened element is fed into a linear projection layer that will produce what they call the “Patch embedding”.

Position embeddings are then linearly added to the sequence of image patches so that the images can retain their positional information. It injects information about the relative or absolute position of the image patches in the sequence.

An extra learnable ( class) embedding is attached to the sequence according to the position of the image patch. This class embedding is used to predict the class of the input image after being updated by self-attention.

The classification is performed by just stacking an MLP Head on top of the Transformer, at the position of the extra learnable embedding that we added to the sequence.” (

ViT were also used for Object Detection in the paper: “End-to-End Object Detection with Transformers”. For more information access:

Transformers in Remote Sensing Image Analisys

In the analysis of Remote Sensing images, the use of ViT begins to be explored, for example, Bazi et al. (2021) used the Aerial image dataset (AID) and the Merced dataset for the task of image classification.

The results of this work showed that this architecture is very promising, obtaining higher values than consolidated architectures that use convolutions.

Study Case: EuroSat Database

We will use the Eurosat Dataset to evaluate the use of ViT in image classification. This dataset contains 27000 images with 13 spectral bands, divided into 10 different classes of land use and land cover.

We will use Google Colab to build and train our ViT. Taking advantage of Colab’s integration with Google Drive, we will store the dataset in Drive, being easily accessed. First, let’s plot a bar graph with the number of images per class:

Samples by Class

We note that the classes contain between 2000 to 3000 images, being partially balanced. Thus, we can carry out the pre-processing steps and the construction of the models that we will use in this comparison.

NN Models:

  • Custom CNN
  • VGG16
  • ResNet-50
  • ViT

Custom CNN

The first model to be used in the classification of the Eurosat dataset will be a customized CNN, with some layers and parameters configured by the user.

Custom CNN

After 300 training epochs, the test dataset has a acuraccy of 0.93.

Model Accuracy chart
Model Loss chart

So we can plot the confusion matrix by class, indicating the quantity predicted by the model:

Confusion Matrix


Now we are going to feed our data into a VGG16 and train it to verify the results:


We will also train this model for 300 epochs. The accuracy obtained in the test dataset was 0.95, slightly better than the customized CNN.

Model Acuraccy VGG16
Model Loss VGG16

Confusion matrix:

Confusion Matrix


We will use ResNet50, a very famous architecture and widely used in computer vision tasks. Like the VGG16, the ResNet-50 is made available by Keras.


After 300 training epochs, the model showed an accuracy of 0.83 in the test dataset, well below the values obtained by previous models.

Model Acuraccy ResNet-50
Model Loss ResNet-50

Confusion Matrix:

Confusion Matrix

Visual Transformers

For the building of this model we will use the example of the link below, where the ViT architecture was created and trained in the cifar10 dataset:

Patchs Creation

So after using 100 epochs for training, the model obtained an accuracy of 0.92 in the test dataset.

Model Acuraccy ViT
Model Loss ViT

Confusion Matrix:

Confusion Matrix


As in the works presented above, our case study demonstrates that the Vision Transformers results for image classification are very exciting. There are other works that combine convolutions and ViT for computer vision tasks, but as we have seen in many articles it may be that the reign of convolutional networks is coming to an end…

Follow me on LinkedIn:


Nerd For Tech

From Confusion to Clarification

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit Don’t forget to check out Ask-NFT, a mentorship ecosystem we’ve started

Joao Otavio Nascimento Firigato

Written by

Deep Learning Computer Vision for Remote Sensing Images

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit Don’t forget to check out Ask-NFT, a mentorship ecosystem we’ve started

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store