A ConvNet for the 2020s: 👍 or 👎?

Published in

AIGuys

7 min readApr 22, 2022

Facebook AI Research Lab, UC Berkeley, CVPR2022

This article is the fifth paper of the “Transformers in Vision” series, which comprises summaries of the recent advanced papers, submitted in the year range of 2020–2022, to top conferences, focusing on transformers in vision.

NerdFacts-🤓 have additional intricate details, which you can skip and still be able to get a high-level flow of paper!
Paper Link: https://arxiv.org/pdf/2201.03545.pdf

✅ Background

Let me tell you a story of how convolutional neural networks were born. Yan LeCun was a curious post-doctoral research student in a group of students advised by Geoffry Hinton and demonstrated that locally-connected neural net architectures are good for visual data analysis. In 1990, YanLeCun joined AT&T Bell Labs and proposed the first-ever Convolutional Neural Network called LeNet for hand-written digit recognition, which became one of his most cited works ever and showed the world that Neural Networks can be applied to Real-world applications.

Check the video out at: https://www.youtube.com/watch?v=FwFduRA_L6Q

LeNet was the first officially widely accepted convolutional neural network. After that came the revolutionary AlexNet and that set the flow up in the research community and we started seeing a whole bunch of variants of CNNs for different vision tasks like object segmentation, detection, and classification. But that went up until in 2020 came revolutionary Vision transformers, which were non-convolutional pure transformer-based networks that outperformed classical CNN-based state-of-the-art networks on the image classification tasks.

Evolution of Neural architectures in the vision domain

That was until the release of vision transformers, ViT, in the 2020s. Vision transformer was the first convolution-free architecture, which when trained on loads of data outperformed convolutional state-of-the-art networks on the ImageNet classification task. After ViTs, the research community turned towards fixing the potential flaws in ViT and unlocking its full “potential” to confirm the hypothesis of Transformers being better than CNN's in vision! Some of the famous works of the transformers’ era include DeiT, SWIN etc.

🧐 So what now!

This work is an effort from the Facebook AI Research lab to revive the convolutional networks’ research in the computer vision research community and tries to prove that CNNs are “old gold” and still better than transformer-based vision models.

⏭️ ConvNeXt

This paper takes a spin on modernizing traditional convolutional neural networks, to make them a strong competitor against advanced vision architectures. Authors start off with a simple good ResNet 50 model, compare it with a SWIN transformer, and add some micro and macro changes to the model to modernize and end up with a model which out-performs SWIN on ImageNet classification and sets a new SOTA.

Let’s go through those changes one by one.

1. Macro-Design Changes 🎂 :

1.1 🍰 Macro-design Change 1: Changing stage compute ratio

Stage compute ratio comparison between ResNet50 and SWIN-Tiny

Authors noted that the number of blocks in stages of SWIN-Tiny is 1:1:3:1 whereas in a ResNet50 it is 3:4:6:3. So authors decided to change the stage compute ratio of a ResNet block similar to SWIN-Tiny and changed it to 3:3:9:3.

Result: This change resulted in an improvement of 0.6% on the ImageNet classification task as compared to a baseline simple ResNet50 and the model’s accuracy moved up from 78.8% to 79.4%.

1.2 🍰 Macro-design Change 2: “Patch-ifying ” stem

The stem cell is the first layer of a neural network that processes input images. In a ResNet, this layer is a 7x7 convolutional layer with a stride size of 2. Whereas a SWIN-Tiny stem cell is patchifying layer, which divided the input image into non-overlapping patches of size 4x4.

(Left) Stem cell in SWIN-Tiny is patch-ifying. (Right) The stem cell in ResNet 50 is a 7x7 Conv with a stride of 4

So authors decided to change the stem cell of ResNet 50 by a convolutional layer of size 4x4 with a stride of 4, to get non-overlapping “patches” with a receptive field of equivalent size.

Results: This change resulted in an improvement of 0.1% as compared to the previous stage’s results and the model’s accuracy moved from 79.4% to 79.5%.

1.3 🍰 Macro-design Change 3: Depthwise Convolutions

The authors observed that the weighted sum operation in a self-attention mechanism is similar to how depthwise convolution works.[Nerd fact: 🤓Depthwise convolution is similar to normal convolution except that in this, each channel of the kernel is convolved with the corresponding input channel, to generate an output feature map of the same channel dimension as input.]

**Top:** Self-Attention, **Bottom:** Depthwise convolution. Value matrix is analogous to feature map, whereas kernel is analogous to attention map.

Result: Authors changed normal convolutions with depthwise convolutions in all blocks in ResNet and increased the dimensions of the output of stem cells and set the ratio between the number of channels in each block to 2. As a result, they got an improvement of 1.2% in the model’s performance, as it went up from 79.5% to 80.5%

1.4 🍰 Macro-design Change 4: Inverted Bottleneck

The authors compared the Multi-layer perceptron block in the SWIN transformer block with a Resnet block in the modified ResNet. They observed that in the MLP block of SWIN-Tiny, the hidden dimension is 4x the input layer, whereas in ResNet the 1x1 Conv layers project down, the input feature to apply 3D Conv and then project it up. The authors decided to make the feature’s dimension similar to the MLP block of SWIN-Tiny.

Modified ResNet block, inspired from ResNeXt and SWIN-Tiny’s MLP block

Results: Authors observed an improvement of 0.1% by introducing this change into the network, which is not much, but the model’s performance still improved a bit i.e. from 80.5% to 80.6%

1.5 🍰 Macro-design Change 5: Larger Kernel Sizes

The authors observed that the size of the convolutional kernel in ResNet50 is 3x3. whereas the size of local attention in MSA blocks of a SWIN transformer is 7x7. They presumed these larger kernel sizes might be a reason for the SWIN transformer’s better performance over Conv nets.

Kernel size comparison between modified ResNet50 and SWIN-Tiny

Result: Authors did not observe any improvement in the model’s performance after this change and got the similar performance to the previous model with kernel sizes of 5 and 7, and hence kept the kernel size to 7, in newer architecture.

After all macro changes, this is what the modified ResNet50, which is now “almost ConvNeXt”, looks like.

2. 🍕 Micro-Design Changes:

To complete the series of modifications, the authors also did some minor modifications to the network, to get the final ConvNeXt block. Those changes are called micro changes and are listed as under:

a. Replace RELU activation with GELU in blocks — because SWIN-Tiny’s MLP block uses GELU, not RELU.

b. Use fewer activation functions, i.e. instead of a non-linearity after every layer, just use one non-linearity after the main 3d Conv layer in a block.

c. Use a fewer normalization layer, just after the main 3d Conv layer of a block.

d. Use LayerNorm in place of BatchNorm, as LN is used in an MSA block of SWIn-Tiny.

e. Separate downsampling layers — I don’t understand this one! Left as a drill for you 😉

Final ConNeXt block vs. ResNet and SWIN-Tiny block

Results: Overall after making all micro-changes accuracy improved by a total of 1.4%. As it went from 80.6 to 81.2% and at this point ConvNeXt out-performed SWIN-Tiny on ImageNet classification which gives 81.3% accuracy on the task.

3. 🔍 Results

3.1 🔥 Evaluation on ImageNet

Authors observed that ConvNeXt out-performed DeIT, ViT, SWIN, Eff-Net, and RegNet on the ImageNet1K classification task, with or without pretraining (on ImageNet 22k). [Source Table 1 paper]

Authors also observed that if they keep the feature size the same at 14x14 throughout the CpnvNeXt, like ViT, it performs on par with ViT on the ImageNet1K classification task too. [ Source Table 2 paper]

3.2🔥️ Evaluation of downstream tasks

One of the reasons for picking the SWIN transformer for motivation to modernize a ResNet50 was its generalizability. Among all other available vision transformer variants, the SWIN transformer has received the Marr prize because it was presented as a generic backbone that does well not only on image classification but on segmentation and detection too.

Authors compared ConvNeXt for COCO object detection [Source Table 3 paper]and ADE-20K segmentation task [Source Table 4 in paper] and observed that ConvNeXt outperforms SWIN and ResNet on the task.