ResNet: The Most Popular Network in the Computer-Vision Era
Must-Know Knowledge for Computer-Vision Workers
It seems challenging to classify images using a computer algorithm. Astonishingly, a recent investigation in the computer-vision area succeeds with a 1.3% top-5 error on the dataset named ImageNet. In 2020, the state-of-the-art in the image classification changed to EfficientNet, which is published by the Google Research Team. However, the network called ResNet performed well in the image classification area for a long period. Moreover, many researchers use ResNet as their network backbones to improve their performance. This article will help you to understand what ResNet is and how it is motivated intuitively.
Link: https://paperswithcode.com/sota/image-classification-on-imagenet
Degradation Problem
Deep Neural Networks suffer from many difficulties in the learning process. Computer Vision researchers address solutions to them, such as solving vanishing/exploding gradient problems with Batch Normalization. (https://arxiv.org/pdf/1502.03167.pdf) The ResNet paper introduces a challenging problem named “Degradation Problem.” Before reading, let’s think about the question below.
More layers, better accuracy?
It seems quite intuitive that adding layers on the network enlarges the output’s diversity. If every added layer is an identity mapping, the new network can output the same value as the original network. Thus, it is persuasive that more layers in a well-trained network, higher classification accuracy. Unfortunately, that is not reality.
When you estimate the accuracy using plain networks(before ResNet), as model complexity increases, its accuracy degrades rapidly. This problem is a Degradation Problem. It is not an overfitting problem; however, the network’s performance dropped as the model complexity increases. The authors claim that plain networks are not suitable for approximating identity mapping; thus, adding layers does not guarantee that the layer-added network can express all the values of the network before the layer addition. The motivation of ResNet is to make an identity-mapping suitable network.
Shortcut-Connection
To make an identity-mapping suitable network, the authors used a method name Shortcut-Connection. The main intuition of this method is rather than learning function F(x), learn function F(x) + x. It is easier to learn an identity mapping; since the layer weights are all tuned to 0, it’ll produce an identity mapping instead of a zero mapping. Moreover, it is differentiable so that end-to-end trainable.
Another consideration of Shortcut-Connection is adding projection in identity. Since the dimension can be different between the Shortcut-connected layer, there are three considerations. A) Zero-padding on increased dimensions, B) Projection shortcuts are used only on the dimension-changed part, C) All Shortcuts are projections. The table below is an estimation of each case. (A, B, and C behind ResNet-34 means A), B), and C) applied in ResNet-34)
The result reveals that performing projection on the identity does not seriously impact on performance. Changing the number of parameters makes the comparison with plain-networks harder. Thus, the authors simply used identity mapping in the network.
Overall Backbone
To refer to the detailed structure of the network, refer to the paper.
Link: https://arxiv.org/pdf/1512.03385.pdf
Experiments
They compared two networks: the plain network and ResNet. Two networks used the same layers; however, only ResNet has Shortcut-Connections. They’ve experimented on two datasets: ImageNet and CIFAR-10. The graphs below are the results of the experiment.
(The thin curves denote training errors, and the bold curves denote validation errors)
As you can see from the graph, the training error increased as the layer number increased. It means that the plain network is suffering from the degradation problems. How about ResNet?
No more degradation problems. As the number of layers increases, their training error decreases.
The authors added more layers in ResNet to make more complicated models. As expected, increasing the number of layers improved the performance. This tendency was similar when the experiment is done on CIFAR-10.
However, we can observe that using 1202 layers on the network, performance drops significantly. The paper argues that it is due to overfitting. Even though there is a significant performance drop, it still outperforms the original methods.
Conclusion
ResNet was motivated to address the degradation problem. By intuitive approach, they designed the network to be suitable for identity-mapping approximation. The experiment shows that ResNet excellently addressed the degradation problems, however, it works poorly for extremely deep networks.
I appreciate any feedback about my articles, For any discussion, you are welcome to email me. If something is wrong or misunderstood, please tell me. :)
Contact me: jeongyw12382@postech.ac.kr
For further reading
D2-matching Explanation: