Swin Transformer V1 and V2 — Best Vision Models Are Not CNN-based

🔥 Revolutionary computer vision networks that are NOT based on CNNs. V2 further improved V1 and beat SOTA networks on accuracy and speed. Fully explained.

Leo Wang
5 min readOct 8, 2022
Photo by Timothy Eberly on Unsplash

Table of Contents

· 💡 Why Transformers on Computer vision?
· 🔥 Swin Transformer (V1)
Procedure
📖 Compared to Vision Transformer
· 🔥 Swin Transformer (V2)
Unstable Training
Pretrain and Train Image Size Discrepancy
· Models’ Stats
· References

💡 Why Transformers on Computer vision?

Transformers have been widely used in Natural Language Processing related tasks, and recently have also been widely applied in many Computer vision tasks because they model global and long-range relationships and semantic information of images better than CNN, with latter being more localized.

🔥 Swin Transformer (V1)

Liu et al. proposed Swin(Shifted WINdow) Transformer in 2021, a general-purpose Transformer adapted to computer vision tasks (general purpose means it’s suitable for different CV tasks such as semantic segmentation and image classification) and achieved better performances than popular ViT (Vision Transformer by Google), and CNN-based self-attention architectures.

Why is it better? The answer essentially lies in its innovative “Shifted Window” design, as shown in Fig. 1. It is the basic building block of Swin Transformer.

Fig. 1: An illustration of the shifted window design (from Swin Transformer paper).

Procedure

Firstly, For an input image (or a feature map), the network would create a local window. By default, the dimension of the window would ensure that it is the 1/4 of that of the image (as shown by the red boxes in Fig. 1).

Secondly, further partitions are made in a shifted window to create smaller patches. Usually there will be 4x4x3=16x3=48 patches, 3 is the number of channels in a RGB image. Each patch is treated as a “token” (in NLP, for example, tokens in sentence “I disliked games” can be [‘I’, ‘dis’, ‘liked’, ‘games’]). Then, a linear embedding layer converts each token to a vector of embeddings, so the Transformer can understand.

Fig. 2: Two successive Swin Transformer blocks (from Swin Transformer paper).

Thirdly, the embeddings are fed into two successive Swin Transformer blocks, as shown in Fig. 2. There are several abbreviated modules in the figure:

  • LN: Layer Normalization.
  • W-MSA: Regular windowing multi-headed self attention.
  • SW-MSA: Shifted windowing multi-headed self attention.
  • MLP: Multi-layer perceptron (multiple fully connected layers).

For the SW-MSA module, which is also the layer (l+1) operation shown in the right of Fig. 1, a different partition scheme is adopted, as shown in Fig. 3 below.

Fig. 3: Second scheme of Shifted Window operation. Regions with the same letter and coloring (Blue, Green and Yellow) are the same (replicas).

The two window partitioning schemes work together to introduce connections between neighboring windows and is found to be effective in a variety of tasks.

Fig. 4 shows how self-attention is calculated. Compared to the regular self-attention, the only difference is the addition of bias matrix “B” of shape (m*m, m*m), with m*m being the number of patches in a window.

B here represents the relative position of patches, so positional information of patches is also encoded in the model.

Fig. 4: Self-attention calculation in Swin Transformer

📖 Compared to Vision Transformer

Fig. 5 is a regular vision Transformer building block. As you can see, a Swin Transformer’s building block and a regular vision Transformer’s building block are similar in many ways Input (Norm = LN). Their differences are essentially in their self-attention mechanisms (Swin Transformer uses Shifted Window with Modified self-attention) and Swin Transformer uses two windowing configurations for an input image.

Fig. 5: A building block for a vision Transformer.

In addition, standard Transformer architectures look for relationships between a token (image patch) and ALL other tokens (image patches) in the whole image, leading to extremely costly computations, especially for large images.

⭐️ However, Swin Transformer adopts a Shifted Window design that only looks for relationships between a token and another one in the windowed area, using two different windowing configurations to establish connections across windows, and shifting windows to encapsulate the whole image. Therefore, computation efficiency is dramatically improved.

🔥 Swin Transformer (V2)

Before talking about details about Swin Transformer V2, we need to understand the disadvantages of Swin Transformer V1.

Fig. 6: Comparison between Swin Transformer V1 and V2.

Unstable Training

This is largely caused by the network activations significantly greater in later layers of the model than the earlier layers, due to the design that the output of the residual is directly added back to the main branch, resulting an unstable training.

To address the first issue, the Layer Normalization (LN) layer is moved from front to back of the residual unit, as shown in Fig. 6. Moreover, scaled cosine attention replaces the matrix multiplication version of self-attention in V1 to make the model totally insensitive to the magnitude (largeness) of activations.

Fig. 7: Cosine based self-attention module.

Two solutions for the first problem successfully improved training stability and model accuracy for larger models.

Pretrain and Train Image Size Discrepancy

The image sizes between pre-trained input images and fine-tuning (later training) input images are quite different. Currently, naive resizing using bicubic interpolation is widely practice. However, it is not sufficient.

To address this issue, the authors used a new coordinate system, log-spaced coordinates, to replace the previous linear-spaced coordinates. Therefore, images are firstly transformed to this new coordinates system before feeding into the network. The new coordinate system significantly reduced the gap between pretraining and fine-tuning image size discrepancy, because of the nature of the log function.

In addition, to solve the high GPU memory consumption as a result of scaling up networks, a series of techniques including zero-optimizer, activation check pointing, and sequential self-attention computation are implemented (won’t be articulated in this article).

Models’ Stats

Table 1: Model statistics for Swin Transformer V1 and V2, and some popular computer vision Transformers.

As shown, two variants of V1 and three variants of V2 are proposed with varying sizes. V2 has successfully improved the performance of V1 and achieved closed-to SOTA results with more efficient computations on ImageNet.

Thank you! ❤️
May we plead with you to consider giving us some applause! ❤️

--

--

Leo Wang

Machine Learning & Deep Learning | Prospective Data Scientist | Founder & Enthusiast & Active Learner & Teacher