ResNet — “Deep Residual Learning for Image Recognition” paper explanation

11 min readNov 21, 2021

While making a network deeper, it can hurt the ability to do well on the training set. But it is not true when you train ResNet!

Source: https://blog.acolyer.org/2017/03/22/convolution-neural-networks-part-3/

Introduction

Deep neural networks perform better than shallower neural networks. As the depth of the neural network increases, the model becomes more eligible to learn more complex patterns of training data. Thus, we may say that complex problems can be solved by introducing more deeper neural networks. At that point, a question arises: Is learning better networks as easy as stacking more layers?

Problems while aiming to train considerably deep neural networks

It turns out that there are two main obstacles to answering previously asked question. First, the “vanishing/exploding gradients” problem which is addressed by ‘weight initialization’ and ‘batch normalization’ techniques, and the second is the “degradation” problem. This paper proposes a “deep residual learning” approach and solves the mentioned second problem. Let’s see what the “degradation” problem is.

“Degradation” problem

As is mentioned above “vanishing/exploding gradients” problem is addressed by weight initialization and batch normalization techniques, which enables networks with dozens of layers to start converging. In fact, before these mentioned techniques, this “vanishing/exploding gradients” problem was a huge barrier to training deep neural networks. When deeper networks are able to start converging, a “degradation” problem has been exposed: with the network depth increasing, accuracy gets saturated and then decreases rapidly.

In fact, this is not caused by overfitting, since adding more layers to the deep learning model leads to higher training error unexpectedly. As we see in the below picture, 20-layer “plain” network(plain network means typical network, not ResNet) performs better than the 54-layer in the training set, which is weird. We at least expected it would perform better in the training set when we used 54-layer “plain” network.

Figure 1. Training error (left) and test error (right) on CIFAR-10
with 20-layer and 56-layer “plain” networks. The deeper network
has higher training error, and thus test error

The mentioned problem that we saw above is exactly a “degradation” problem. Theoretically, we expect that while adding a new layer to the neural network, the performance has to become better or at least the same when we are comparing with a shallower one. But in reality, as we have seen above adding more layers to the network actually spoils the performance of the model.

This picture represents the “degradation” problem situation. Source: https://medium.com/swlh/resnet-a-simple-understanding-of-the-residual-networks-bfd8a1b4a447

So why does this happen? Why adding more layers does not at least give the same performance as the shallower network, and even gives a worse result?

Let’s understand the problem with the below image. Imagine that we have achieved 80% accuracy with “Neural Net #1”(with 2 hidden layers) on the training set. Then we decided to add more 3 layers to it(so created “Neural Net #2”). After training the second “Neural Net #2” we expect to achieve at least 80% accuracy because the first 2 hidden layers have already performed 80% accuracy, so the rest 3 newly added layers will try to learn more complex new patterns to boost the performance. So, if there are more complex new patterns to learn for the 3 newly added layers, they will learn and result better performance, and contrary, if there are no more new complex patterns to learn, these 3 new added layers will just act as identity function and will be responsible for carrying the things to the output layer that have been learned by the first 2 hidden layers, so consequently will result 80% accuracy, quite rational. We expect that scenario. Let’s see what happens in reality.

The mentioned example above. Source: https://medium.com/@realmichaelye/intuition-for-resnet-deep-residual-learning-for-image-recognition-39d24d173e78

However, in reality unexpectedly as we have seen in ‘Figure 1’ it is not able to perform at least the same performance as a shallower network while adding a new layer on, even it gives worse performance(See ‘Figure 1’) so we face the “degradation” problem.

As we might have already guessed, the problem is neural network can not learn the identity function for newly added layers, within the training process when needed, for at least not spoiling the outputs of the previous layers above example and to give at least the same performance without hurting it.

Since the depth of representations is of central importance for many visual recognition tasks, the “degradation” problem is a huge barrier. This problem first prevents us from adding new layers to our model, consequently, we can not learn more complex patterns(it is really huge drawback), the second also does not allow us to achieve at least the same performance as we did with the shallower network, in other words, spoils the information that has been learned in previous layers by not being able to learn the identity function when needed.

It would be perfect if we give the ability to the neural network that it can figure out itself when to force its layers to act as an identity function and also when it sees something to learn just acts as a typical layer and learns it. Because we might have the situation when for some layers acting as identity function will be needed for carrying the information that has been learned in previous layers without hurting them and for others will not, it has to be flexible and figured out by the network. ResNets solves this!

How does “ResNets” solve the “degradation” problem?

As we might all know, in each layer neural network tries to learn one mapping function, let’s call it H(x).

In each layer we want to learn some H(x) function.

As we have seen above example later newly added layers in deeper networks are unable to learn at least the identity function that was required to carry the result to the output. That means for the neural network it is really hard to learn the identity function within the training process. Let’s see how ResNet helps the network to learn the identity function when needed.

In the deep residual network what they propose is instead of hoping each few stacked layers directly fit a desired underlying mapping(means instead of hoping that H(x) function will be the identity function when needed), we explicitly let these layers fit a residual mapping “F(x)” which is equal to “H(x) - x”. It simply means instead of learning the H(x) function, we learn the F(x) function within the layer, and output F(x) + x. In that way when this layer wants to act as the identity function, it actually learns zero function rather than identity function.(when F(x) =0, H(x) = 0 + x, and the output H(x) will be x, so consequently H(x) becomes identity function ). It turns out that learning zero function is much more easier to learn for the neural network within the training process when needed rather than identity function. Because the neural network can easily learn zero function by pushing its all weights to zero in corresponding layers(for instance with L2/L1 regularization), but for learning identity function it has to try plenty of combinations in order to make it identity function, therefore it can not find that combinations for the each layer that has to be the identity function within training process, and consequently spoils the information that has been learned in previous layers that is why we get worse results. Thus, after allowing neural network to learn zero function(which is much more easier for it) rather than identity function when needed we are able to make H(x) identity function(H(x) = 0 + x -> means H(x) becomes an identity function)

We are learning the F(x) function with 2 layers instead of **H(x)** in the residual network. This graph is also called “residual block”

You may wonder why F(x) is learned by using 2 layers rather than 1 layer. That point is mentioned in the paper like that: “The form of the residual function F is flexible. Experiments in this paper involve a function F that has two or three layers (Fig. 5), while more layers are possible. But if F has only a single layer, Eqn.(1) is similar to a linear layer: y = W1 x + x, for which we have not observed advantages.” Which means they have tried by using only 1 layer in their experiments but have not observed any improvement. They also emphasized that there is not such a rule that it has to be always with 2 layers, it can be more. The most common ResNets (resnet 50/101/152) actually have 3 layers per skip.

There is another point to mention. So far we have mostly talked about learning identity function and we have seen that adding residual block to the network helps us to learn identity function easily. But being able to learn just identity function is not our ultimate goal. Means our purpose is not just not hurting the while adding new layers to the current network. Because it will not be rational, since the simpler model already gives the same performance why our aim has to be “ let’s add new layers but without hurting performance”. Of course, we want to also boost the performance while adding new layers, since we are making the network more complex. At that point, we can understand why it’s called residual network. We actually learn residuals in the “residual blocks”. For instance, let’s say one layer outputs 6 before the “residual block”(means residual block’s input is 6 in other words x = 6), (with 6 imagine it’s 80 %)but for making neural network model’s performance better actually it should have been 5(imagine the performance would be 82% if it was 5), so that means this “residual block” mission is to learn “-1” (not zero in this case) and add to that output in order to make it 5(F(x) has to be -1 so -> H(x) = -1 + 6), so actually in that way the “residual block” can also boost the performance. So for that reason, they used 2 layers rather than 1 and mentioned it can be more, because having multiple layers helps approximate the residuals better, and it is worthwhile to approximate with more layers. And of course in a case when it sees that 6 was perfect its mission will be to learn zero function and make H(x) identity function which we have mostly talked about it above. I just wanted to mention that our mission is not always learning identity function, the network itself decides this. If “residual blocks” think that the current output is perfect it will learn zero function for not hurting the current performance, if not they will learn residual to add the corresponding output for making the current performance better. Of course, residual can be either positive or negative number(and also zero), I just mentioned the case when “residual block” wants to decrease the current output.

Understanding more why it is easier to learn zero function rather than identity with example

Imagine we have that kind of conv layer with only 1 filter(3*3), and in one situation this conv layer has to act as an identity function for not spoiling the performance because there is actually nothing to do with corresponding “input”, this is already perfect, nothing to learn we have to pass it to the next layer without changing something. In such case, model has to learn identity filter within the training process, which means except w5(w5 has to be 1 ) all the 9 parameter(there is also bias b) have to be zero or the model have to find combinations with bias b and all other 9 parameters to make this identity filter. So, imagine that we have to do the same thing for more than one layers and also some other filters. Moreover, when we have giant filters model has to find these combinations for each filer for each layer, where we need “identity mapping”.

What residual network does instead, says the layer: “if you think that your filter has to act as an identity mapping, push all your weights to zero”. As we know from ‘L1/L2 regularization’ weight decay methods, practically pushing weights to be zero is much more easier. In the contrary case, the model has to find combinations of weights and bias with backpropagation, that finally to be able to learn identity function.

The below picture is a “residual block” representation in the original paper.

Source: https://www.researchgate.net/publication/330750910_Recognizing_Pornographic_Images_using_Deep_Convolutional_Neural_Networks/figures?lo=1

And it’s an example of the 34-layer ResNet model.

The paper actually showed in their experiments section that by using ResNet architecture they achieved better performance than plain network by using more layers. The below picture shows their experimental result.

They did better by using ResNet. Source: paper

One more detail

Since H(x) = F(x) + x, “F(x)” and “x” vectors have to have same dimension(due to vector addition rule). For not facing that situation in convolutional layers they have used “same padding”. But for pooling layers they changed the formula a little bit like “H(x) = F(x) + Ws x”. So, imagine that F(x) is 256 dimensional and x is 128 dimensional, Ws becomes 256*128 dimensional matrix and makes “Ws x” 256 dimensional for being able to perform addition. This “Ws” could be a matrix of parameters to be learned or just a fixed matrix that implements zero-padding that takes “x” and zero pads it.

So we can also see in the residual network representation in the paper that they changed the sign of the “shortcut connection” in order to show that since they used the pooling layer in the corresponding part they have used the above formula to calculate H(x).

The “shortcut connection” sign changed on purpose to show that since they used the polling layer they used the above formula. Source: Paper

Conclusion

To conclude, being able to get better performance on the training set is really crucial. Because doing well on the training set is usually a prerequisite to do well on the development set or test set. The “degradation” was one of the barrier to that, and ResNets apparently solves that.

I encourage you to read the paper yourself since it was just an explanation of the core idea of the paper, a lot more is mentioned in the original paper. Paper link: https://arxiv.org/pdf/1512.03385.pdf

Thanks for reading!