ResNet: A Simple Understanding of the Residual Networks

Taha Shahid
The Startup
Published in
6 min readSep 1, 2020

A comprehensive guide to understanding the beginning of ResNets (Residual Networks) and how they have helped solve a major complication in the domain of deep neural networks

Image Credits to Jirsak, Shutterstock

Whilst the term deep learning was introduced to the global audience in 1986 by Rina Dechter, one among the many illustrious pioneers of the deep learning community, the idea of it could be traced back to 1943 when Walter Pitts and Warren McCulloch constructed a computer model that was based on the neural networks of the human brain, which undoubtedly became the starting point for many theoretical investigations regarding the subject.

Needless to say, the Deep learning community have since come a long way learning and building upon the idea. It has grown and evolved remarkably and is now being used extensively to automate processes, detect patterns, improve performance and solve complex problems of the human world.

The realm of Deep learning is not foreign to challenges, unforeseen complications and repercussions that hinder the growth and limit the true potential that it could achieve. Years of dedicated research, study and persistence to eliminate the obstacles have led to the discovery of newer concepts, ideas, architectures and models that have outshone and outperformed its predecessors by a significant margin.

Significance of depth, and the Degradation Problem in the Deep Neural Networks

Deep neural networks have been able to extract a high number of interpretative patterns or features from the training data and learn very complex yet meaningful representations.

Extraction, learning and Integration of Features, Image credits to Jason Toy(Source)

Extraction and discovery of these features or patterns can be credited to the depth of the neural network as they are more susceptible to be found at the later layers of the network. As the nature of the problems being input to the neural network became increasingly difficult researchers started exploiting deeper and deeper models to achieve higher accuracy in results. It was observed that deep neural networks performed better than shallower neural networks.

Without significant depth, the model will not be able to integrate the different level of features in a complex manner to learn from the training data. This conclusion leads to the result that complex problems can be solved by introducing really deep learning models (having layers > 50). People began experimenting with deep learning models as deep as 100 layers to achieve a higher accuracy score on the training data.

A Plain Deep Learning model with 34 hidden layers, Image Credits to the authors of original ResNet paper(Source)

However, this conclusion on the importance of depth arouse an intriguing question: Is learning better networks as easy as stacking more layers?

In theory as the number of layers in a plain neural network increases, it should get progressively better at recognizing complex functions and features, resulting in better accuracy and learning.
However, much to the contrary to the popular belief it was observed that this model was inefficient at providing the expected result. Furthermore, the training accuracy began to drop after a certain point.

An obstacle to answering the above question and understanding the discrepancy between theory and reality was the notorious problem of vanishing/exploding gradients. They hampered the convergence from the beginning which made the model unstable in its ability to learn accurately and efficiently. This problem has however been largely addressed by Recurrent Neural Networks(using LSTM’s), normalized initialization and intermediate normalization layers. This has allowed models with a higher count of layers to converge for stochastic gradient descent and backpropagation.

Even after resolving the issue of vanishing/exploding gradients, it was observed that training accuracy dropped when the count of layers was increased. This can be seen in the image below.

It is observed that the network having a higher count (56-layer) of layers are resulting in higher training error in contrast to the network having a much lower count (20-layer) of layers thus resulting in higher test errors! Image Credits to the authors of original ResNet paper(Source)

One might assume that this could be the result of overfitting. However, that is not the case here as deeper networks show higher training error not testing errors. Overfitting tends to occur when training errors are significantly lower than test errors.

This is called the degradation problem. With the network depth increasing the accuracy saturates(the networks learns everything before reaching the final layer) and then begins to degrade rapidly if more layers are introduced.

To better explain as to why the result here seems surprising and unexpected let us consider the following example.

Suppose we have a neural network with “n” layers that gives you a training error “x”. Now consider a deeper neural network with “m” (m > n) layers. When we train this network, we expect it to perform at least as good as the previous model (n-layers) because the first “n” layers out of the total “m” layers will result in the same accuracy and if the model requires more complex representation then the remaining “m-n” layers will learn it and if no more learning is required then the remaining “m-n” layers will behave as an identity function responsible for carrying the output to the final layer. Hence we could conclude that the neural network with “m” layers will give training error “y”( y ≤ x).

But this doesn’t happen in practice and deeper neural networks do not necessarily give lower training errors. x

Theory vs Reality on variation in training error with an increase in layers, Image by article author

What are ResNets(Residual Networks) and how they help solve the degradation problem

Kaiming He, Xiangyu Zhang, Shaoqin Ren, Jian Sun of the Microsoft Research team presented a residual learning framework (ResNets) to help ease the training of the networks that are substantially deeper than before by eliminating the degradation problem. They have proved with evidence that ResNets are easier to optimize and can have high accuracy at considerable depths.

As we have seen previously that latter layers in deeper networks are unable to learn the identity function that is required to carry the result to the output. In residual networks instead of hoping that the layers fit the desired mapping, we let these layers fit a residual mapping.

Initially, the desired mapping is H(x). We let the networks, however, to fit the residual mapping F(x) = H(x)-x, as the network found it easier to optimize the residual mapping rather than the original mapping

Networks learn the second mapping easily as compared to the first one, Image credits to ML Explained — A.I. Socratic Circles — AISC

This method of bypassing the data from one layer to another is called as shortcut connections or skip connections. This approach allows the data to flow easily between the layers without hampering the learning ability of the deep learning model. The advantage of adding this type of skip connection is that if any layer hurts the performance of the model, it will be skipped.

Skip Connect or Shortcut Connection, Image Credits to the authors of original ResNet paper(Source)

The intuition behind the skip connection is that it is easier for the network to learn to convert the value of f(x) to zero so that it behaves like an identity function rather than learning to behave like an identity function altogether on its own by trying to find the right set of values that would give you the result.

34-layer ResNet model, Image Credits to the authors of original ResNet paper(Source)

ResNet uses two major building blocks to construct the entire network.

  1. The Identity Block (Same as above)
ID Block, Image credits to X Wei(source)

2. The Conv Block

Conv Block, Image credits to X Wei(source)

The conv block helps to modify and restructure the incoming data so that the output of the first layer matches dimensions of the third layer so they can be added.

These components help achieve higher optimization and accuracy for the deep learning models. The results accurately show the effect of using ResNet over plain layers in the graph below.

As seen ResNet performs better than plain neural network models, Image Credits to the authors of original ResNet paper(Source)
plain vs ResNet, Image Credits to the authors of original ResNet paper(Source)

Hence we can easily conclude that ResNet is undoubtedly a milestone in deep learning. With its shortcut connections/skip connections, it has allowed the deep learning community to venture into deeper neural network models which in turn has given us significantly better results.

--

--

Taha Shahid
The Startup

Deep Learning Enthusiast | Flutter App Developer | Designer | Pursuing Bachelor’s Degree in Computer Engineering