Review of Deep Residual Learning for Image Recognition

John Olafenwa
Apr 21, 2018 · 9 min read

Original Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun

Microsoft Research Asia (MSRA) (

Review Authors: John Olafenwa and Moses Olafenwa

AI Commons,


Deep Learning is based on the idea of stacking many layers of neurons together. Such neurons include fully connected layers where every output in the previous layer is connected to every node in the next layer, locally connected convolution layers with kernels that act as feature detectors and recurrent layers such as Gated Recurrent Units (GRU) and Long-Short Term Memory (LSTM) cells. Over the past years, deep learning researchers have successfully optimized the performance of neural networks by stacking more layers. With the growing availability of high performance GPUs as well as larger datasets. This technique has proven very effective. However, deeper networks are more difficult to train, not because of their computational cost, but due to difficulty of propagating gradients through so many layers. Neural networks train by computing the derivatives of parameters with respect to the training loss function. However, with so many deep layers, derivatives start to diminish, this is known as the gradient vanishing problem. Normalization techniques such as the highly effective batch normalization helped to greatly alleviate the problem of vanishing gradients by normalizing the activations of every single layer to have zero mean and unit variance with respect to the statistics obtained per batch during training. However, the problem still resurfaces as layers go very deep.

In this paper, the authors comprehensively analyse the cause and effects of vanishing gradients and devised an effective solution to enable the training of ultra-deep networks.

The authors also present a new family of image recognition networks that formed the basis of their submission to the ImageNet2015 and COCO 2015 challenges. On all benchmarks, their network outperforms all state of the art models.

In this review, we examine the key points in the original paper and prior related work, we comprehensively analyse the structure of their network and we conclude with an open source implementation of the original models.

SECTION 1: Key points

1.1 Shallow Vs Deep Models

Key to the performance of artificial neural networks is the depth of the network. Neural networks act as both feature extractors and classifiers at the same time. The ability of neural networks to act as automatic feature extractors greatly increases their ability to generalize to new problems, for example, to classify unseen images. With manual feature engineering, it is extremely hard in domains such as image recognition to hand-craft the features necessary for classification of new unseen images. Shallow networks are fully capable of automatic feature extraction, however, due to the low depth of representation, they cannot extract fine grained features that would ultimately allow the model to generalize properly. Deeper models on the other hand are able to extract low-level, mid-level and high-level features. Hence, they act as more excellent feature extractors than shallow models, this has enabled them to perform much better at classification that shallow models. This is evidenced by the fact that all leading models in various domains of deep learning, exploit the concept of depth.

However, a key problem with great depth is loss of information as networks go very deep. A simple intuition behind this is to consider decision making as a function of the history of events. Events happen in sequence, with past events influencing future events, consider in this light; a decision maker that can only see the last past event. Such a decision maker makes decisions on the grand assumption that the past event already encodes all we need to know about all the previous events. For very short history of events, this assumption can hold fine because there is often a strong correlation between closely successive events, however, when the history is long, at each time step, information about how the past affects the future is lost gradually, as we go deeper into the future, eventually, we become very short-sighted, relying only on the consequence of past time steps without putting actual past events into consideration when making decisions.

The depth problem becomes exposed as layers go very deep. The authors trained a 20-layer network and a 56-layer network on the CIFAR10 dataset. Surprisingly, the 20-layer network outperformed the 56-layer network. Thus, it became clear that simply stacking more layers is not sufficient to optimize deep neural networks. The 56-layer network also had higher training error than the 20-layer network, this clearly indicates that it is not an overfitting problem, hence, well known regularization techniques like dropout cannot be applied to solve the problem.

1.2 Residual Functions

To solve the gradient vanishing problem associated with ultra-deep networks, the authors introduced residual connections into the network. Residual connections are simply connections between a layer and layers after the next.

This idea is clearly illustrated in the diagram below:

In the diagram above, the plain network simply sends information over from one layer to the next, information about the past state of the image is highly limited and all activations must be based on the new features, the residual connections on the other hand takes the future map from layer t and adds it to the output of layer t + 2.

This is equivalent to learning the residual function y = f(x) + x

In direct feedforward networks without residual connections, a layer T only relies on data at layer T — 1 with layer T -1 encoding the consequence of all the previous layers, on the other hand, residual connections look farther into the past, putting into consideration information from layer T — 2.

This very simple but powerful idea enabled the authors to train over a 100 layers network with increasing accuracy.

It is noteworthy that while the authors originally considered residual connections as being important for depth, future work has proven that residual networks can improve the performance of both shallow and deep neural networks. This agrees with our illustration of residual functions as improving accuracy by providing sufficient data about the original state of the data.

SECTION 2: Related Prior Work

Adding features from previous time steps has been used in various tasks involving multi-layer fully connected networks has well as convolutional neural networks. Most notable of these are Highway networks proposed by Srivastava et al. Highway networks feature residual connections however, unlike resnet, their residual connections are gated. Hence, information flow from the past is determined by how much of the data the gating mechanism allows to pass through. This idea was primarily inspired by gating mechanisms in LSTMs.

While residual networks have the form

Y = f(x ) + x

Highway networks have the form

Y = f(x ). sigmoid(Wx + b) + x. (1 —sigmoid (Wx + b))

Note that in the equation for highway networks, the sigmoid function takes the general form 1/ (1 +e^-x), the sigmoid function always outputs values in the range of 0–1, the parameters W and b are learned weights and bias which controls the output of the sigmoid function. A non-residual network can be viewed as a special case of highway networks with the output of the sigmoid gate as 1.

Given sigmoid(Wx + b) = 1

y = f(x). + x. (1–1) = f(x)

When the output of the sigmoid gate is 0, a highway network becomes an identity function

Given sigmoid(Wx + b) = 0

y = f(x).0 + x. (1–0) = x

Highway networks enabled information flow from the past but due to the gating function, the flow of information can still be impeded. Hence, a highway network with 19 layers performed better than a highway network with 32 layers.

SECTION 2: Network Structure

Resnets have a very homogenous structure, they are similar in construction to VGG by Simonyan et al. They are made up of many layers of residual modules, which are in turn grouped into residual blocks.

2.1: Resnet Modules

Resnet modules are of two variants; the first is made up of two layers of 3 x 3 convolutions, the other which is more popular is called a bottleneck layer, because it is composed of a 1 x 1 convolution that reduces the number of channels by a factor of 4, followed by a 3 x 3 convolution and finally a 1 x 1 convolution that expands the layers back to C. The motivation for the bottleneck block is to reduce the computational cost of the network, since 1 x 1 convolutions are 9 times less expensive than a 3 x 3 convolution, they are used to minimize the number of channels that comes into the 3 x 3 convolution.

The bottleneck module can constructed in keras as below:

A resnet module

To have a clear understanding of the above code, consider this picture we carefully drew.

In the diagram above, an input x comes into the module, C represents the number of output channels, we pass the input into a 1 x 1 conv with channels equal C / 4, followed by batch normalization and relu. This setting is repeated for the 3 x 3 conv, finally, we pass the output of the 3 x 3 conv through a 1 x 1 conv with C channels, followed by only batch normalization.

Also, in the beginning we let the residual equal to the input, but if we are to pool, which often involves doubling the number of channels, then the residual is the result of a strided 1 x 1 conv with channels equal to C. If this is not done, the dimensions of the residual and the output would not match.

Finally, we add the residual with the output of the last 1 x 1 Conv — BN layer. We then apply relu on the result of the addition.

A resnet block is a stack of the bottleneck layer.

A resnet block

Note in the above, we set pool = True only in the first layer in each block, this is clearly defined in the code above.

Finally we define the full resnet

The above code is highly modular to allow scaling to thousands of layers. Here only 50, 101 and 152 layers are supported, if other values are supplied, a value error is raised, having said that, the code can be modified to scale to ultra-deep networks.

Resnet is made of of 4 blocks. The number of layers and filters for each block is determined by the block_layers dictionary.

block_layers = {50: [3, 4, 6, 3],
101: [3, 4, 23, 3],
152: [3,8,36,3]

The block_layers defines a dictionary that maps the total number of layers to the number of modules per block. For the 50-layer network, there are 3 modules in the first block, 4 modules in the second, 6 modules in the third and 3 in the fourth.

Note that given that each module consists of 3 convolutional layers, if you compute the total number of layers for each configuration, you would realize it is equal to num_layers — 2. Hence, for 50 layers we would have (3 + 4 + 6 + 3) * 3 = 48.

For the 101 layers, we would have 99 layers while for the 152 layers network we would have 150 layers.

The reason for this is, there is a convolution layer before the first block, and a fully connected layer that maps the feature maps to the class predictions at the end. This two layers are added to the number of layers in the blocks to make 50, 101 and 152 layers depending on the desired number of layers.

The block_filters dictionary determines filters for each block


block_filters = {50: [256, 512, 1024, 2048],
101: [256, 512, 1024, 2048],
152: [256, 512, 1024, 2048]

For all the different configurations, the first block has 256 filters for all modules in each block, 512 for the second block, 1024 for the third and 2048 for the fourth.

Finally, GlobalAveragePooling2D is applied on the output feature maps, this is simply a standard AveragePooling with pool size equal to the width and height of the feature map, hence, the result would be 1 x 1 x Filters, since each filter would become a 1 x 1 feature map.

The output of this is passed into a fully connected layer with softmax.


The depth of neural networks is absolutely important to obtaining better performance. When neural network layers are very deep, they exhibit higher training and validation loss due to information loss leading to vanishing gradients. To guarantee that deeper networks would always yield better accuracy than shallow networks, the authors of Resnet proposed the use of residual blocks, bringing in information from the past to compensate for information loss. This technique enables the training of very deep networks resulting in state of the art accuracy on standard benchmarks.

To reduce computation cost due to very deep layers, the authors use bottleneck layers with 1 x 1 convolutions that reduce the number of feature maps of the input.


This post is a part of the Deep Review project by AI Commons by Moses Olafenwa and John Olafenwa. Visit to learn more about our mission to advance and democratize Artificial Intelligence.


Simplifying the best Deep Learning papers for easy…