Swin Transformer and ResNet-50 for object detection and segmentation
Object detection and segmentation are essential topics in computer vision. The Transformer has been one of the major approaches to dealing with images. In this post, I will talk about the usage and implementation of the Swin Transformer for object detection and segmentation. To compare the result of the Swin Transformer, I also used ResNet-50.
Here, I use a two-stage algorithm:
First, I trained the Swin Transformer as a backbone using the self-supervised learning algorithm MoBY. Similarly, I trained the ResNet-50 as a backbone using the self-supervised learning algorithm DINO.
Next, I used the backbone (Swin Transformer and ResNet-50) with Mask R-CNN and Feature Pyramid Network to perform object detection and segmentation. This step is supervised training.
The whole project is available below:
For the backbone training of Swin Transformer and also the home directory of the project, please check out here:
For the backbone training of ResNet-50, please check out here:
For the implementation of Mask R-CNN with Feature Pyramid Network and Backbone (Swin Transformer and ResNet-50), please check out here:
It’s time to introduce the datasets and architectures.
I used two datasets: the 2017 ImageNet DET and the 2017 COCO Dataset for object detection. ImageNet is used for self-supervised training, and COCO is used for supervised training.
ImageNet:
COCO:
For the architectures, I will introduce Swin-Transformer and Mask R-CNN.
I think Swin-Transformer is a transformer with the fashion of a Convoluted Neural Network.
The overall architecture is straightforward. First, the input (an RGB image) is split into non-overlapping patches. For each patch, the size is 4 * 4 * 3 = 48, much smaller than the patches in ViT, which is 16 * 16 * 3. Then, there are four stages in the architecture. Stage 1 consists of a linear embedding layer and a Swin Transformer Block. The linear embedding layer deal with the patch feature with an arbitrary dimension (C in Figure 1). Next, for stages 2, 3, and 4, each consists of a patch merging layer and a Swin Transformer Block.
Now, there are two questions. What is a patch merging layer? What is a Swin Transformer Block?
A patch merging layer merges each group of a certain number of neighboring patches by concatenating. The merging layer in “Stage 2” concatenates the features of each group of 2 * 2 neighboring patches and applies a linear layer on the 4C-dimensional concatenated features (from the Swin Transformer Paper). We obtain “Stage 3” and “Stage 4” with the same functioning merging layers and Swin Transformer blocks.
Next, from the paper: Swin Transformer is built by replacing the standard multi-head self-attention (MSA) module in a Transformer block with a module based on shifted windows, with other layers kept the same. As illustrated in Figure 4(b), a Swin Transformer block consists of a shifted window-based MSA module, followed by a 2-layer MLP with GELU nonlinearity in between. A LayerNorm (LN) layer is applied before each MSA module and each MLP, and a residual connection is applied after each module.
The shifted window is the approach for computing self-attention in the Swin Transformer Architecture. The shifted window enables the network to learn different neighboring information for all the patches. In Figure 2, at Layer l, there are four red windows, each with 16 patches. If every layer has the same structure as Layer l, the network can not learn the neighboring info about patches next to each other and is separated by windows. So, at Layer l+1 in Figure 2, the windows are shifted, and the network can learn different neighboring information.
For more information about Swin-Transformer, I wrote a Medium story about it. Please check the link below. Also, this link is a reference to the above introduction about the Swin Transformer.
Here is the link of the original paper of Swin Transformer:
Next, it’s about Mask R-CNN.
Mask R-CNN is similar to Faster R-CNN. The difference is that Mask R-CNN has an additional branch to predict an object mask, while Faster R-CNN has only two branches to predict the class and bounding offset, respectively.
To achieve pixel-to-pixel alignment where Faster R-CNN lacks, Mask R-CNN uses RoI-Align instead of RoI Pooling used in Faster R-CNN. For each sample point, RoI-Align computes the bilinear interpolation from the four nearest grid points on the feature map. This way of computation will avoid quantizing RoI boundaries or bins (e.g., rounding x/16) to achieve pixel-to-pixel alignment.
Here is the reference and the original paper of Mask R-CNN:
For the implementation of Mask R-CNN, openmmlab has a nice implementation. I used mmdetection package to complete the tasks. Feel free to check out the official mmdetection GitHub over here, or you can check out my GitHub repo posted at the beginning.
Results:
For Fairness, both Swin Transformer and ResNet-50 are trained with 100 epochs in the self-supervised training and 10 epochs in the supervised training.
For Swin Transformer:
Bounding Box — 10 epochs training
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.421
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.644
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.461
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.261
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.453
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.557
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.558
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.558
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.558
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.376
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.592
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.708
Segmentation
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.390
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.615
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.420
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.198
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.417
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.570
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.521
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.521
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.521
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.338
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.556
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.680
For ResNet-50
Bounding Box
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.362
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.562
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.395
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.204
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.395
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.468
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.516
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.516
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.516
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.325
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.555
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.647
Segmentation
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.331
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.533
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.350
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.149
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.355
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.481
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.470
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.470
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.470
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.279
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.510
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.610
Swin Transformer behaves better than the ResNet-50 backbone in object detection and segmentation.
Thank you for your reading.