Swin Transformer 🚀: Hierarchical Vision Transformer using Shifted Window — Part II

Microsoft Research, ICML’21 -🏆Marr Prize

Published in

AIGuys

4 min readFeb 25, 2022

This article is the third paper of the “Transformers in Vision” series, which comprises summaries of the recent advanced papers, submitted in the year range of 2020–2022, to top conferences, focusing on transformers in vision.

*NerdFacts-🤓 have additional intricate details, which you can skip and still be able to get a high-level flow of paper!

1. SWIN Transformer🚀 Background

In part-I of the SWIN Transformers🚀 article, we talked about the background, its comparison with ViT and DeiT, and a detailed review of its architecture. If you haven’t read that one, it’s better to give it a read, but the high-level idea of a SWIN transformer is very simple. SWIN transformer🚀 is a general-purpose backbone invented by Microsoft Research, that aims at replacing the convolution-based backbones for high-level computer vision tasks like semantic segmentation, object detection, and image classification.

SWIN Transformer 🚀 Architecture — inspired heavily from a typical Convolutional network

In part II, of SWIN Transformer🚀, we will shed some light on the performance of SWIN in terms of how well it performed as a new backbone for different Computer vision tasks. So let’s dive in!

2. SWIN Transformer Performance💡

The performance of SWIN was analyzed on three different tasks of object detection, segmentation, and classification on benchmark datasets. SWIN had three different size variants, detailed below.

2.1 Object Detection 📦🏷️

For evaluating the SWIN transformer backbone for the object detection tasks, authors used standard bec=nchmark dataset COCO. For comparison with DOTA networks, authors chose famous ConvNets like ResNeXt, YOLOv4, GCNet, CopyPaste, and famous transformer-based networks like DeiT.

As we can see that SWIN was able to out-perform famous convolutional object detection networks on average by a point’s margin. Although the FLOPs to accuracy trade-off is not provided in the results, it is kind of hard to compare that because of a diverse range of chosen networks. I wish authors would have compared their work with YOLOv5, which is better and faster than previous YOLOs.

2.2 Image Classification 🏞

Image classification is analyzed for two versions of the SWIN backbone. The first version is trained on the ImageNet-1k dataset, results are given below.

SWIN Transformer performed pretty well on image classification, we can see it out-performed DeiT and ViT with comparable or fewer parameters. Like SWIN-B is comparable with DeiT-B in the number of parameters but SWIN-B out-performs DeiT by 3 points in terms of accuracy. Not only transformer-based networks, but SWIN outperformed convolutional image classification backbones as well like EffNet and RegNet. However it is noticeable that EfficientNet is performing pretty closer to the SWIN-B with lesser parameters, that is SWIN gets an 84.5 accuracy with 88M params, but EfficientNet gets an 84.3 accuracy with 66M parameters! Reflecting SWIN backbone is requiring 20M more parameters, to make up for the performance of a convolutional network!

The second version of the Image classification SWIN variant is pre-trained on ImageNet-22K first and then finetuned on ImageNet-1k, results are given below.

SWIN Transformer🚀 pre-trained on ImageNet-22K

this variant of SWIN is first pre-trained on ImageNet-22K, finetuned on ImageNet-1K, and then evaluated on ImageNet-1k. We can see that pre-trained models got a boost of 2–3 points in accuracy and out-performed the convolutional ResNet and ViT.

2.3 Semantic Segmentation 🍒

SWIN was analyzed for image segmentation on the benchmark dataset for semantic segmentation, ADE20K.

SWIN Transformer🚀 ADE20K semantic segmentation

For this task, the authors attached a segmentation head on top of the SWIN backbone to analyze the pixel-level semantic understanding extracted by the model. The method used to assemble the features extracted from the backbone(aka neck), was UNet. We can see that SWIN outperformed DeiT-S by whooping 2 points, by having 8M more parameters. In convolutional backbones, DLab v3 comes quite close to SWIN-S, in terms of parameters but SWIN is able to out-perform it by 1 point.

We can conclude that SWIN is a strong backbone proposed by Microsoft Research and allows a new approach to induce inductive biases in Transformers, to make them perform well in the vision domain and they succeeded. This backbone was proposed for images and a future variant was introduced called VideoSWIN for performing these tasks in videos.

I believe SWIN is a great approach to capture inductive biases in spatial data but it can be improved by not being purely transformer-based, If we add some convolutional layers in architecture long with this spatial attention, I think that will make the backbone stronger.

Brainstorm about one way you would make this backbone stronger or find a weakness about the architecture, something you would have done differently if you were with the team ….🧐Think!

Happy Learning! ❤️