A ConvNet for the 2020s: 👍 or 👎?
Facebook AI Research Lab, UC Berkeley, CVPR2022
This article is the fifth paper of the “Transformers in Vision” series, which comprises summaries of the recent advanced papers, submitted in the year range of 2020–2022, to top conferences, focusing on transformers in vision.
- NerdFacts-🤓 have additional intricate details, which you can skip and still be able to get a high-level flow of paper!
- Paper Link: https://arxiv.org/pdf/2201.03545.pdf
✅ Background
Let me tell you a story of how convolutional neural networks were born. Yan LeCun was a curious post-doctoral research student in a group of students advised by Geoffry Hinton and demonstrated that locally-connected neural net architectures are good for visual data analysis. In 1990, YanLeCun joined AT&T Bell Labs and proposed the first-ever Convolutional Neural Network called LeNet for hand-written digit recognition, which became one of his most cited works ever and showed the world that Neural Networks can be applied to Real-world applications.
LeNet was the first officially widely accepted convolutional neural network. After that came the revolutionary AlexNet and that set the flow up in the research community and we started seeing a whole bunch of variants of CNNs for different vision tasks like object segmentation, detection, and classification. But that went up until in 2020 came revolutionary Vision transformers, which were non-convolutional pure transformer-based networks that outperformed classical CNN-based state-of-the-art networks on the image classification tasks.
That was until the release of vision transformers, ViT, in the 2020s. Vision transformer was the first convolution-free architecture, which when trained on loads of data outperformed convolutional state-of-the-art networks on the ImageNet classification task. After ViTs, the research community turned towards fixing the potential flaws in ViT and unlocking its full “potential” to confirm the hypothesis of Transformers being better than CNN's in vision! Some of the famous works of the transformers’ era include DeiT, SWIN etc.
🧐 So what now!
This work is an effort from the Facebook AI Research lab to revive the convolutional networks’ research in the computer vision research community and tries to prove that CNNs are “old gold” and still better than transformer-based vision models.
⏭️ ConvNeXt
This paper takes a spin on modernizing traditional convolutional neural networks, to make them a strong competitor against advanced vision architectures. Authors start off with a simple good ResNet 50 model, compare it with a SWIN transformer, and add some micro and macro changes to the model to modernize and end up with a model which out-performs SWIN on ImageNet classification and sets a new SOTA.
Let’s go through those changes one by one.
1. Macro-Design Changes 🎂 :
1.1 🍰 Macro-design Change 1: Changing stage compute ratio
Authors noted that the number of blocks in stages of SWIN-Tiny is 1:1:3:1 whereas in a ResNet50 it is 3:4:6:3. So authors decided to change the stage compute ratio of a ResNet block similar to SWIN-Tiny and changed it to 3:3:9:3.
Result: This change resulted in an improvement of 0.6% on the ImageNet classification task as compared to a baseline simple ResNet50 and the model’s accuracy moved up from 78.8% to 79.4%.
1.2 🍰 Macro-design Change 2: “Patch-ifying ” stem
The stem cell is the first layer of a neural network that processes input images. In a ResNet, this layer is a 7x7 convolutional layer with a stride size of 2. Whereas a SWIN-Tiny stem cell is patchifying layer, which divided the input image into non-overlapping patches of size 4x4.
So authors decided to change the stem cell of ResNet 50 by a convolutional layer of size 4x4 with a stride of 4, to get non-overlapping “patches” with a receptive field of equivalent size.
Results: This change resulted in an improvement of 0.1% as compared to the previous stage’s results and the model’s accuracy moved from 79.4% to 79.5%.
1.3 🍰 Macro-design Change 3: Depthwise Convolutions
The authors observed that the weighted sum operation in a self-attention mechanism is similar to how depthwise convolution works.[Nerd fact: 🤓Depthwise convolution is similar to normal convolution except that in this, each channel of the kernel is convolved with the corresponding input channel, to generate an output feature map of the same channel dimension as input.]
Result: Authors changed normal convolutions with depthwise convolutions in all blocks in ResNet and increased the dimensions of the output of stem cells and set the ratio between the number of channels in each block to 2. As a result, they got an improvement of 1.2% in the model’s performance, as it went up from 79.5% to 80.5%
1.4 🍰 Macro-design Change 4: Inverted Bottleneck
The authors compared the Multi-layer perceptron block in the SWIN transformer block with a Resnet block in the modified ResNet. They observed that in the MLP block of SWIN-Tiny, the hidden dimension is 4x the input layer, whereas in ResNet the 1x1 Conv layers project down, the input feature to apply 3D Conv and then project it up. The authors decided to make the feature’s dimension similar to the MLP block of SWIN-Tiny.
Results: Authors observed an improvement of 0.1% by introducing this change into the network, which is not much, but the model’s performance still improved a bit i.e. from 80.5% to 80.6%
1.5 🍰 Macro-design Change 5: Larger Kernel Sizes
The authors observed that the size of the convolutional kernel in ResNet50 is 3x3. whereas the size of local attention in MSA blocks of a SWIN transformer is 7x7. They presumed these larger kernel sizes might be a reason for the SWIN transformer’s better performance over Conv nets.
Result: Authors did not observe any improvement in the model’s performance after this change and got the similar performance to the previous model with kernel sizes of 5 and 7, and hence kept the kernel size to 7, in newer architecture.
After all macro changes, this is what the modified ResNet50, which is now “almost ConvNeXt”, looks like.
2. 🍕 Micro-Design Changes:
To complete the series of modifications, the authors also did some minor modifications to the network, to get the final ConvNeXt block. Those changes are called micro changes and are listed as under:
a. Replace RELU activation with GELU in blocks — because SWIN-Tiny’s MLP block uses GELU, not RELU.
b. Use fewer activation functions, i.e. instead of a non-linearity after every layer, just use one non-linearity after the main 3d Conv layer in a block.
c. Use a fewer normalization layer, just after the main 3d Conv layer of a block.
d. Use LayerNorm in place of BatchNorm, as LN is used in an MSA block of SWIn-Tiny.
e. Separate downsampling layers — I don’t understand this one! Left as a drill for you 😉
Results: Overall after making all micro-changes accuracy improved by a total of 1.4%. As it went from 80.6 to 81.2% and at this point ConvNeXt out-performed SWIN-Tiny on ImageNet classification which gives 81.3% accuracy on the task.
3. 🔍 Results
3.1 🔥 Evaluation on ImageNet
Authors observed that ConvNeXt out-performed DeIT, ViT, SWIN, Eff-Net, and RegNet on the ImageNet1K classification task, with or without pretraining (on ImageNet 22k). [Source Table 1 paper]
Authors also observed that if they keep the feature size the same at 14x14 throughout the CpnvNeXt, like ViT, it performs on par with ViT on the ImageNet1K classification task too. [ Source Table 2 paper]
3.2🔥️ Evaluation of downstream tasks
One of the reasons for picking the SWIN transformer for motivation to modernize a ResNet50 was its generalizability. Among all other available vision transformer variants, the SWIN transformer has received the Marr prize because it was presented as a generic backbone that does well not only on image classification but on segmentation and detection too.
Authors compared ConvNeXt for COCO object detection [Source Table 3 paper]and ADE-20K segmentation task [Source Table 4 in paper] and observed that ConvNeXt outperforms SWIN and ResNet on the task.
That’s all folks!
Happy Learning❤️