ResNet Paper Summary

4 min readMay 19, 2022

ResNet is a paper that introduced a new architecture for image recognition, here is the full paper.

Problem

Many of the models before ResNet use to just add more and more layers and the results of the model would do at least as good as the shallower model. The logic behind this is that the deeper model has all the same parts as the shallow model plus more.

Shallow model

Deep Model

For the deep model to do just as good as the shallow model the new hidden layers (the ones in blue in the picture above) only need to keep the same output from the old hidden layer (in fancy words the new layers have to find the identity mapping)

So the auteurs of the Resnet paper asked a very basic question ” Is learning better networks as easy as stacking more layers?”, so in order to determine if you could just add more and more layers and get a better and better models they made an experiment. For the experiment, they made two different models to try and predict if having more layers always made the model better, one had 20 layers and the other had 56.

Trained on CIFAR-10, the picture is from the paper

As you can see above the model with more layers actually did worse than the model with fewer layers, one possible explanation could be that the larger model suffered from the vanishing gradient problem. While this was a problem in the past for large models, it has been mostly fixed by the use of batch normalization (or other normalized initializations). So it seems the main problem with the deeper model is that the new layers it cant learn to do nothing (it cant learn the identity mapping).

Solution

The solution to this problem that the authors came up with was to add short cut connections to the model. A short cut layer basically splits the input into two parts, one part waits on the side and the other goes through a few new layers and at the end they are combined through addition.

the thought process on why this model should work is as follows, as we saw in the last part the deeper model was not able to find the identity easily, however in this new model the identity is given.

Results

in order to test there new model they created two different models (to see more details about the architecture look in the paper)

Plain network — no short cut connections (18 layers and 34 layers)
Residual network — has short cut connections (18 layers and 34 layers)

Plain Network: using the Plain Network the problem discussed before appears. The 34 layer version does worse than the 18 layer version

Residual Network : using the Residual network the problem is not there, the 34 layer model did better then the 18 layer model

The short cut connections used for this task look like this

There can be two or more layers within the short cut connection, there cant be one because then it could only be linear. Another possible issue with the model is that for the short cut layer to add the identity and the new features they must be the same size. There are a two solutions to this problem zero padding and projection. Zero padding just adds zeros to make the pictures the same size, projection projects the image to the correct size. the benefit of using zero padding is that there is no computational complexity added but if projection is used there is. in the paper they decided to use a mix of the two options.

Conclusion

Congrats we did it, now we can add more and more layers and the model will do better and better, or can we? in the last part of the paper they build a very large Residual Network with 1202 layers and it did worse than a Residual Network with 110 layers. The reasoning they give in the paper is that the 1202 layer model overfit the data, they also gave a possible solution but did not do any experiments with it. Their idea was just to add some regularization methods to the model such as dropout.