Building Vision Transformer From Scratch using PyTorch: An Image worth 16X16 Words.

Published in

The Deep Hub

8 min readFeb 21, 2024

Hey 👏

I hope you are doing great.

I’m starting a series here on Medium for building various important ViT models from scratch with PyTorch. I’ll explain the code. I’ll explain the theory. I’ll break down things step by step. So without further ado let’s get straight to it.

What Are Vision Transformers

The Self-Attention transformers model is considered a de facto standard for seq-to-seq tasks with large context lengths. In June 2021 “An Imag Is Worth 16X16 Words: Transformers for Image Recognition at Scale” was released. The main idea was to leverage the potential of global transformers for computer vision tasks.

The ViT model mainly introduces two things.

Building Vision Transformer From Scratch using PyTorch: An Image worth 16X16 Words.

Patch Embeddings
Using the Transformer’s encoder block

The Self-Attention Transformer encoder used here is almost the same as the standard one introduced in the holy grail of “Attention is All you need”.

This Implementation requires einops for performing operations on higher dimension tensors.

pip install eniops

Patch Embeddings:

We want to take an input image and make small patches of it of size (patch_size, patch_size) and flatten it to (patch_size)2 as shown in the image above.

If the image size is 56X56 and patch size is 4, total number of patches would be = (image size/patch size)2 i.e. 14X14 or 196

class PatchEmbedding(nn.Module):
    def __init__(self, in_channels: int = 3, patch_size: int = 16, emb_size: int = 768, img_size=224):
        self.patch_size = patch_size
        super().__init__()
        self.embed = nn.Sequential(
            nn.Conv2d(in_channels, emb_size, kernel_size=patch_size, stride=patch_size),
            Rearrange('b e (h) (w) -> b (h w) e'),
        )


    def forward(self, x: Tensor) -> Tensor:
        b, _, _, _ = x.shape
        x = self.embed(x)    
        return x

self.embed(x) where, x Shape(batch, channels, height, width)
The nn.Conv2d would change the dimension to (batch, emb_size, height/patch_size, width/patch_size)
And then we would use rearrange to change the dims to (batch, n, emb_size) where n: number of patches.

We could also use another method such as first forming the patches and then flattening the images, but using a conv2d shows significant performance gain.

Ex: If the Input Image is of shape (3, 224, 224) and patch = 4 the window size would be H/patch_size, W/patch size i.e. (16 X 16) .

Thus flattening it would give us (16 X 16) 256 patches, and then pass the tensor (batch, 256) to a Linear layer we would add the embedding along each token making it (batch, 256, emb_dim). using a conv2d will achieve a similar goal. (batch, 3, 224, 224) -> (batch, emb_dim, 16, 16) and rearrange it.

Class Token

In a Vision Transformer, the image is divided into non-overlapping patches, and each patch is treated as a token. The class token is a specific token that is appended to the sequence of patch tokens and serves as a representative of the entire image. During training, the class token embeds information about the global context of the image and allows the model to capture relationships between different patches.

new token length = 1 (cls_token_vector) + n(token length)

Positional Embedding

The position embedding is added to every token in the input (even class token) so that the model can have some information about the position of the tokens concerning each other.

The most standard way to represent or pass information about the position that you can think of is simply creating a linear embedding vector. I know this sounds a bit weird that’s probably because just forming a stark vector is a bad way to represent positional information of the tokens. That is why we use sine-cosine positional embedding. There is a good bunch of content and videos available for why it is better but I won’t get into the nitty-gritty details of it right now.

However, this is the standard formula for positional embedding

def PositionEmbedding(seq_len, emb_size):
    embeddings = torch.ones(seq_len, emb_size)
    for i in range(seq_len):
        for j in range(emb_size):
            embeddings[i][j] = np.sin(i / (pow(10000, j / emb_size))) if j % 2 == 0 else np.cos(i / (pow(10000, (j - 1) / emb_size)))
    return torch.tensor(embeddings)

class PatchEmbedding(nn.Module):
    def __init__(self, in_channels: int = 3, patch_size: int = 16, emb_size: int = 768, img_size=224):
        self.patch_size = patch_size
        super().__init__()
        self.projection = nn.Sequential(
            nn.Conv2d(in_channels, emb_size, kernel_size=patch_size, stride=patch_size),
            Rearrange('b e (h) (w) -> b (h w) e'),
        )

        self.cls_token = nn.Parameter(torch.rand(1, 1, emb_size))
        self.pos_embed = nn.Parameter(PositionEmbedding((img_size // patch_size)**2 + 1, emb_size))
    
    def forward(self, x: Tensor) -> Tensor:
        b, _, _, _ = x.shape
        x = self.projection(x)    

        cls_token = repeat(self.cls_token, ' () s e -> b s e', b=b)

        x = torch.cat([cls_token, x], dim=1)

        x = x + self.pos_embed
        return x

self.cls_token: We created a tensor of embedding size (1, 1, emb_size) and then utilized the repeat function to make it (batch size, 1, emb_size).
self.pos_embed: The PositionEmbedding function simply returns a tensor (n (new token length), embedding size)
nn.Parameter: It's a PyTorch class, it adds whatever tensor is inside it to the list of the optimizable parameters of the module (class PatchEmbedding) so that the optimizer could optimize it during training.

The input image x (shape: B, H, W, C) is passed through the projection layer which then returns a shape ( B, n, embed size ) where token length n = (window size/patch_size)²
Then the cls_token is concatenated to x (shape: B, n+1, emb_size)
Finally, we add the position embedding to x.

Multihead Self Attention

The Multihead Attention used in ViT is the same as the one used in the encoder module of the original transformer architecture.

What we want here is a way for our vectors (input x (shape: B, N, emb) to communicate with each other and figure out how much one token is related to another token. What we would want here is another vector of the same shape as input x but have information about the correlation among the tokens.

That’s why we would start by creating three vectors. Query, Key, and Value. They all have the same shape as input x. Each token generated three vectors. a query is idealized to be the information a token is looking for, the key represents the information that a token holds, and the value is more like what information I will give to you!

class MultiHead(nn.Module):
  def __init__(self, emb_size, num_head):
    super().__init__()
    self.emb_size = emb_size
    self.num_head = num_head
    self.key = nn.Linear(emb_size, emb_size)
    self.value = nn.Linear(emb_size, emb_size)
    self.query = nn.Linear(emb_size, emb_size) 
    self.att_dr = nn.Dropout(0.1)
  def forward(self, x):
    k = self.key(x)
    q = self.value(x)
    v = self.query(x)

But in Multihead Attention we would want to create sections of heads where we can perform attention separately. Thus we use einops to divide the embeddings of each token into equal num_heads.

class MultiHead(nn.Module):
  def __init__(self, emb_size, num_head):
    super().__init__()
    self.emb_size = emb_size
    self.num_head = num_head
    self.key = nn.Linear(emb_size, emb_size)
    self.value = nn.Linear(emb_size, emb_size)
    self.query = nn.Linear(emb_size, emb_size) 
    self.att_dr = nn.Dropout(0.1)
  def forward(self, x):
    k = rearrange(self.key(x), 'b n (h e) -> b h n e', h=self.num_head)
    q = rearrange(self.key(x), 'b n (h e) -> b h n e', h=self.num_head)
    v = rearrange(self.key(x), 'b n (h e) -> b h n e', h=self.num_head)

Here h is the num_heads or number of equal parts you would want to divide embedding into.

Attention scores are calculated by taking the dot product of the Query matrix (Q) and the transpose of the Key matrix (K). The result is divided by the square root of the dimension of the key vectors.
This step computes how much each element in the sequence should attend to every other element.
The attention scores are passed through a softmax activation function. This converts the scores into probabilities, ensuring that the weights assigned to different positions sum to 1.
The softmax-normalized attention scores are used to compute a weighted sum of the Value matrix (V). This step emphasizes the importance of different parts of the input sequence based on the attention scores.

class MultiHead(nn.Module):
  def __init__(self, emb_size, num_head):
    super().__init__()
    self.emb_size = emb_size
    self.num_head = num_head
    self.key = nn.Linear(emb_size, emb_size)
    self.value = nn.Linear(emb_size, emb_size)
    self.query = nn.Linear(emb_size, emb_size) 
    self.att_dr = nn.Dropout(0.1)
  def forward(self, x):
    k = rearrange(self.key(x), 'b n (h e) -> b h n e', h=self.num_head)
    q = rearrange(self.key(x), 'b n (h e) -> b h n e', h=self.num_head)
    v = rearrange(self.key(x), 'b n (h e) -> b h n e', h=self.num_head)


    wei = q@k.transpose(3,2)/self.num_head ** 0.5    
    wei = F.softmax(wei, dim=2)
    wei = self.att_dr(wei)

    out = wei@v

    out = rearrange(out, 'b h n e -> b n (h e)')
    return out

Here is the final look for class MultiHead. We rearrange the number of heads, multiply it with the embedding convert it back to its input dimension.

Transformer Encoder

The transformer encoder block consists of normalization layers, two skip connections, a Multi-layer perception, and an MHSA.

Creating a Feed Forward Network

class FeedForward(nn.Module):
  def __init__(self, emb_size):
    super().__init__()
    self.ff = nn.Sequential(
        nn.Linear(emb_size, 4*emb_size),
        nn.Linear(4*emb_size, emb_size)
    )
  def forward(self, x):
    return self.ff(x)

The Encoder Block

class Block(nn.Module):
  def __init__(self,emb_size, num_head):
    super().__init__()
    self.att = MultiHead(emb_size, num_head)
    self.ll =   nn.LayerNorm(emb_size)
    self.dropout = nn.Dropout(0.1)
    self.ff = FeedForward(emb_size)
  def forward(self, x):
    x = x + self.dropout(self.att(self.ll(x)))  # self.att(x): x -> (b , n, emb_size) 
    x = x + self.dropout(self.ff(self.ll(x)))
    return x

Breaking it Down

We pass an input x of shape (B, N, emb_dim)
We apply a layer norm to the last dimension of the x (embeddings) for the sake of stability and pass it down to the Multihead Attention layer.
We apply a dropout (standard for preventing overfitting). Now the output dimension is still the same as the input x (B, N, emb_dim)
we will now add a skip connection, by adding x to the output of self.att
we again apply a layer norm to x and this time pass it down to a feed-forward network. This is done right after a skip connection is made. The main reason for using a multi-layer perceptron is more or less “I now have the feature x containing information about tokens and now I would want them to communicate with each other about what they have learned”
We then use our final skip connection and return x.

Vision Transformer

Quoting Optimus Prime (cause you know… its transformers ;)

“We are here”. Finally, we put it all together in a single module

class VissionTransformer(nn.Module):
  def __init__(self, num_layers, img_size, emb_size, patch_size, num_head, num_class):
    super().__init__()
    self.attention = nn.Sequential(*[Block(emb_size, num_head) for _ in range(num_layers)])
    self.patchemb = PatchEmbedding(patch_size=patch_size, img_size=img_size)
    self.ff = nn.Linear(emb_size, num_class)

  def forward(self, x):     # x -> (b, c, h, w)
    embeddings = self.patchemb(x)    
    x = self.attention(embeddings)    # embeddings -> (b, (h/patch_size * w/patch_size) + 1, emb_dim)
    x = self.ff(x[:, 0, :])
    return x

We fit the VisionTransformer model, using the standard parameters proposed in the paper:

device = 'cuda' if torch.cuda.is_available() else 'cpu'
num_layers = 8
emb_size = 768
num_head = 6
num_class=10
patch_size=16
model = VissionTransformer( num_layers=num_layers,
                            img_size=224,
                            emb_size=emb_size,
                            patch_size=patch_size,
                            num_head=num_head,
                            num_class=num_class).to(device)

If you liked my work, consider a clap or Follow. You can also check the entire code on my GitHub repository: https://github.com/mishra-18/ML-Models/blob/main/Vission%20Transformers/vit.py

Thanks..