Swin Transformer and ResNet-50 for object detection and segmentation

Yuanzhe Liu
8 min readJan 10, 2023

--

Object detection and segmentation are essential topics in computer vision. The Transformer has been one of the major approaches to dealing with images. In this post, I will talk about the usage and implementation of the Swin Transformer for object detection and segmentation. To compare the result of the Swin Transformer, I also used ResNet-50.

Here, I use a two-stage algorithm:

First, I trained the Swin Transformer as a backbone using the self-supervised learning algorithm MoBY. Similarly, I trained the ResNet-50 as a backbone using the self-supervised learning algorithm DINO.

Next, I used the backbone (Swin Transformer and ResNet-50) with Mask R-CNN and Feature Pyramid Network to perform object detection and segmentation. This step is supervised training.

The whole project is available below:

For the backbone training of Swin Transformer and also the home directory of the project, please check out here:

For the backbone training of ResNet-50, please check out here:

For the implementation of Mask R-CNN with Feature Pyramid Network and Backbone (Swin Transformer and ResNet-50), please check out here:

It’s time to introduce the datasets and architectures.

I used two datasets: the 2017 ImageNet DET and the 2017 COCO Dataset for object detection. ImageNet is used for self-supervised training, and COCO is used for supervised training.

ImageNet:

COCO:

For the architectures, I will introduce Swin-Transformer and Mask R-CNN.

I think Swin-Transformer is a transformer with the fashion of a Convoluted Neural Network.

Figure 1: Swin Transformer Architecture from the original paper
Figure 1: Swin-Transformer architecture from the original paper

The overall architecture is straightforward. First, the input (an RGB image) is split into non-overlapping patches. For each patch, the size is 4 * 4 * 3 = 48, much smaller than the patches in ViT, which is 16 * 16 * 3. Then, there are four stages in the architecture. Stage 1 consists of a linear embedding layer and a Swin Transformer Block. The linear embedding layer deal with the patch feature with an arbitrary dimension (C in Figure 1). Next, for stages 2, 3, and 4, each consists of a patch merging layer and a Swin Transformer Block.

Now, there are two questions. What is a patch merging layer? What is a Swin Transformer Block?

A patch merging layer merges each group of a certain number of neighboring patches by concatenating. The merging layer in “Stage 2” concatenates the features of each group of 2 * 2 neighboring patches and applies a linear layer on the 4C-dimensional concatenated features (from the Swin Transformer Paper). We obtain “Stage 3” and “Stage 4” with the same functioning merging layers and Swin Transformer blocks.

Next, from the paper: Swin Transformer is built by replacing the standard multi-head self-attention (MSA) module in a Transformer block with a module based on shifted windows, with other layers kept the same. As illustrated in Figure 4(b), a Swin Transformer block consists of a shifted window-based MSA module, followed by a 2-layer MLP with GELU nonlinearity in between. A LayerNorm (LN) layer is applied before each MSA module and each MLP, and a residual connection is applied after each module.

Figure 2: shifted window approach for computing self-attention in the Swin Transformer Architecture. Image from the original paper

The shifted window is the approach for computing self-attention in the Swin Transformer Architecture. The shifted window enables the network to learn different neighboring information for all the patches. In Figure 2, at Layer l, there are four red windows, each with 16 patches. If every layer has the same structure as Layer l, the network can not learn the neighboring info about patches next to each other and is separated by windows. So, at Layer l+1 in Figure 2, the windows are shifted, and the network can learn different neighboring information.

For more information about Swin-Transformer, I wrote a Medium story about it. Please check the link below. Also, this link is a reference to the above introduction about the Swin Transformer.

Here is the link of the original paper of Swin Transformer:

Next, it’s about Mask R-CNN.

Mask R-CNN is similar to Faster R-CNN. The difference is that Mask R-CNN has an additional branch to predict an object mask, while Faster R-CNN has only two branches to predict the class and bounding offset, respectively.

Figure 3: Mask R-CNN structure, from https://paperswithcode.com/method/mask-r-cnn
Figure 4: RoIAlign, from the original paper

To achieve pixel-to-pixel alignment where Faster R-CNN lacks, Mask R-CNN uses RoI-Align instead of RoI Pooling used in Faster R-CNN. For each sample point, RoI-Align computes the bilinear interpolation from the four nearest grid points on the feature map. This way of computation will avoid quantizing RoI boundaries or bins (e.g., rounding x/16) to achieve pixel-to-pixel alignment.

Here is the reference and the original paper of Mask R-CNN:

For the implementation of Mask R-CNN, openmmlab has a nice implementation. I used mmdetection package to complete the tasks. Feel free to check out the official mmdetection GitHub over here, or you can check out my GitHub repo posted at the beginning.

Results:

For Fairness, both Swin Transformer and ResNet-50 are trained with 100 epochs in the self-supervised training and 10 epochs in the supervised training.

For Swin Transformer:

Bounding Box — 10 epochs training

Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.421

Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.644

Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.461

Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.261

Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.453

Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.557

Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.558

Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.558

Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.558

Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.376

Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.592

Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.708

Segmentation

Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.390

Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.615

Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.420

Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.198

Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.417

Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.570

Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.521

Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.521

Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.521

Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.338

Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.556

Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.680

Figure 5: Mask R-CNN + FPN + Swin Transformer, images by author

For ResNet-50

Bounding Box

Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.362

Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.562

Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.395

Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.204

Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.395

Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.468

Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.516

Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.516

Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.516

Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.325

Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.555

Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.647

Segmentation

Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.331

Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.533

Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.350

Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.149

Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.355

Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.481

Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.470

Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.470

Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.470

Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.279

Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.510

Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.610

Figure 6: Mask R-CNN + FPN + ResNet-50, images by author

Swin Transformer behaves better than the ResNet-50 backbone in object detection and segmentation.

Thank you for your reading.

--

--

Yuanzhe Liu

New York University Graduate Student, interested in deep learning