Vision Transformers vs. Convolutional Neural Networks

7 min readJun 4, 2023

This blog post is inspired by the paper titled AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE from google’s research team. The paper proposes using a pure Transformer applied directly to image patches for image classification tasks. The Vision Transformer (ViT) outperforms state-of-the-art convolutional networks in multiple benchmarks while requiring fewer computational resources to train, after being pre-trained on large amounts of data.

The code can be found in this GitHub repo:

https://github.com/RustamyF/vision-transformer

Transformers have become the model of choice in NLP due to their computational efficiency and scalability. In computer vision, convolutional neural network (CNN) architectures remain dominant, but some researchers have tried combining CNNs with self-attention. The authors experimented with applying a standard Transformer directly to images and found that when trained on mid-sized datasets, the models had modest accuracy compared to ResNet-like architectures. However, when trained on larger datasets, the Vision Transformer (ViT) achieved excellent results and approached or surpassed the state of the art on multiple image recognition benchmarks.

Figure 1 (taken from the original paper) describes a model that processes 2D images by transforming them into sequences of flattened 2D patches. The patches are then mapped to a constant latent vector size with a trainable linear projection. A learnable embedding is prepended to the sequence of patches and its state at the output of the Transformer encoder serves as the image representation. The image representation is then passed through a classification head for either pre-training or fine-tuning. Position embeddings are added to retain positional information and the sequence of embedding vectors serves as input to the Transformer encoder, which consists of alternating layers of multiheaded self-attention and MLP blocks.

In the past, CNNs have been the go-to choice for image processing tasks for a long time. They excel at capturing local spatial patterns through convolutional layers, enabling hierarchical feature extraction. CNNs are adept at learning from large amounts of image data and have achieved remarkable success in tasks like image classification, object detection, and segmentation.

While CNNs have a proven track record in various computer vision tasks and handle large-scale datasets efficiently, Vision Transformers offer advantages in scenarios where global dependencies and contextual understanding are crucial. However, Vision Transformers typically require larger amounts of training data to achieve comparable performance to CNNs. Also, CNNs are computationally efficient due to their parallelizable nature, making them more practical for real-time and resource-constrained applications.

Example: CNN vs. Vision Transformer

In this section, we will train a vision classifier on the cats and dogs dataset available in Kaggle, using both CNN and vision transformer approaches. First, we will download the cats and dogs dataset from Kaggle with 25000 RGB images. If you haven’t already, you can read the instructions here to learn how to get your Kaggle API credential set up. The following Python code will download the dataset into your current working directory.

from kaggle.api.kaggle_api_extended import KaggleApi

api = KaggleApi()
api.authenticate()

# we write to the current directory with './'
api.dataset_download_files('karakaggle/kaggle-cat-vs-dog-dataset', path='./')

Once the files are downloaded, you can unzip the files using the following commands.

!unzip -qq kaggle-cat-vs-dog-dataset.zip
!rm -r kaggle-cat-vs-dog-dataset.zip

Clone the vision-transformer GitHub repository using the following command. This repository has all the code required for the vision transformers under vision_tr directory.

!git clone https://github.com/RustamyF/vision-transformer.git
!mv vision-transformer/vision_tr .

The downloaded data needs to be cleaned and prepared for training our image classifier. The following utility functions are created to clean and load the data in Pytorch’s DataLoader format.

import torch.nn as nn
import torch
import torch.optim as optim

from torchvision import datasets, models, transforms
from torch.utils.data import DataLoader, Dataset
from PIL import Image
from sklearn.model_selection import train_test_split

import os


class LoadData:
    def __init__(self):
        self.cat_path = 'kagglecatsanddogs_3367a/PetImages/Cat'
        self.dog_path = 'kagglecatsanddogs_3367a/PetImages/Dog'

    def delete_non_jpeg_files(self, directory):
        for filename in os.listdir(directory):
            if not filename.endswith('.jpg') and not filename.endswith('.jpeg'):
                file_path = os.path.join(directory, filename)
                try:
                    if os.path.isfile(file_path) or os.path.islink(file_path):
                        os.unlink(file_path)
                    elif os.path.isdir(file_path):
                        shutil.rmtree(file_path)
                    print('deleted', file_path)
                except Exception as e:
                    print('Failed to delete %s. Reason: %s' % (file_path, e))

    def data(self):
        self.delete_non_jpeg_files(self.dog_path)
        self.delete_non_jpeg_files(self.cat_path)

        dog_list = os.listdir(self.dog_path)
        dog_list = [(os.path.join(self.dog_path, i), 1) for i in dog_list]

        cat_list = os.listdir(self.cat_path)
        cat_list = [(os.path.join(self.cat_path, i), 0) for i in cat_list]

        total_list = cat_list + dog_list

        train_list, test_list = train_test_split(total_list, test_size=0.2)
        train_list, val_list = train_test_split(train_list, test_size=0.2)
        print('train list', len(train_list))
        print('test list', len(test_list))
        print('val list', len(val_list))
        return train_list, test_list, val_list


# data Augumentation
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
])


class dataset(torch.utils.data.Dataset):

    def __init__(self, file_list, transform=None):
        self.file_list = file_list
        self.transform = transform

    # dataset length
    def __len__(self):
        self.filelength = len(self.file_list)
        return self.filelength

    # load an one of images
    def __getitem__(self, idx):
        img_path, label = self.file_list[idx]
        img = Image.open(img_path).convert('RGB')
        img_transformed = self.transform(img)
        return img_transformed, label

CNN Approach

The CNN model for this image classifier consists of three layers of 2D convolutions, with a kernel size of 3, stride of 2, and a max pooling layer of 2. Following the convolution layers, there are two fully connected layers, each composed of 10 nodes. Here is a code snippet that illustrates this structure:

class Cnn(nn.Module):
    def __init__(self):
        super(Cnn, self).__init__()

        self.layer1 = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, padding=0, stride=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )

        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=3, padding=0, stride=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )

        self.layer3 = nn.Sequential(
            nn.Conv2d(32, 64, kernel_size=3, padding=0, stride=2),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )

        self.fc1 = nn.Linear(3 * 3 * 64, 10)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(10, 2)
        self.relu = nn.ReLU()

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = self.layer3(out)
        out = out.view(out.size(0), -1)
        out = self.relu(self.fc1(out))
        out = self.fc2(out)
        return out

The training was performed with a Tesla T4 (g4dn-xlarge) GPU machine for 10 training epochs. The Jupyter Notebook is available in the project’s GitHub repository and contains the code for the training loop. The following are the results of training loops for each epoch.

Vision Transformer Approach

The Vision Transformer architecture is designed with customizable dimensions that can be adjusted according to specific requirements. For this size of image dataset, this architecture is still big.

from vision_tr.simple_vit import ViT
model = ViT(
    image_size=224,
    patch_size=32,
    num_classes=2,
    dim=128,
    depth=12,
    heads=8,
    mlp_dim=1024,
    dropout=0.1,
    emb_dropout=0.1,
).to(device)

Each parameter in the vision transformer plays a key role and is described here:

image_size=224: This parameter specifies the desired size (width and height) of the input images to the model. In this case, the images are expected to be of size 224x224 pixels.
patch_size=32: The images are divided into smaller patches, and this parameter defines the size (width and height) of each patch. In this case, each patch is 32x32 pixels.
num_classes=2: This parameter indicates the number of classes in the classification task. In this example, the model is designed to classify inputs into two classes (cats and dogs).
dim=128: It specifies the dimensionality of the embedding vectors in the model. The embeddings capture the representation of each image patch.
depth=12: This parameter defines the depth or number of layers in the Vision Transformer model (encoder model). A higher depth allows for more complex feature extraction.
heads=8: This parameter represents the number of attention heads in the self-attention mechanism of the model.
mlp_dim=1024: It specifies the dimensionality of the Multi-Layer Perceptron (MLP) hidden layers in the model. The MLP is responsible for transforming the token representations after self-attention.
dropout=0.1: This parameter controls the dropout rate, which is a regularization technique used to prevent overfitting. It randomly sets a fraction of input units to 0 during training.
emb_dropout=0.1: It defines the dropout rate specifically applied to the token embeddings. This dropout helps prevent over-reliance on specific tokens during training.

The training of the vision transformer for the classification task was performed with the Tesla T4 (g4dn-xlarge) GPU machine for 20 training epochs. The training was conducted for 20 epochs (instead of 10 epochs used for CNN) because the training loss’s convergence was slow. The following are the results of training loops for each epoch.

The CNN approach reached 75% accuracy in 10 epochs, while the vision transformer model reached 69% accuracy and took significantly longer to train.

Conclusion

In conclusion, when comparing CNN and Vision Transformer models, there are notable differences in terms of model size, memory requirements, accuracy, and performance. CNN models are traditionally known for their compact size and efficient memory utilization, making them suitable for resource-constrained environments. They have proven to be highly effective in image processing tasks and exhibit excellent accuracy in various computer vision applications. On the other hand, Vision Transformers offer a powerful approach to capture global dependencies and contextual understanding in images, resulting in improved performance in certain tasks. However, Vision Transformers tend to have larger model sizes and higher memory requirements compared to CNNs. While they may achieve impressive accuracy, especially when dealing with larger datasets, the computational demands can limit their practicality in scenarios with limited resources. Ultimately, the choice between CNN and Vision Transformer models depends on the specific requirements of the task at hand, considering factors such as available resources, dataset size, and the trade-off between model complexity, accuracy, and performance. As the field of computer vision continues to evolve, further advancements in both architectures are expected, enabling researchers and practitioners to make more informed choices based on their specific needs and constraints.

Vision Transformers vs. Convolutional Neural Networks

Example: CNN vs. Vision Transformer

Written by Fahim Rustamy, PhD