Microsoft Presents : Deep Residual Networks

Baki Er
5 min readAug 9, 2016

In this post, I will talk about Microsoft’s Residual Networks architecture with which they have 3.57% top-5 error on ImageNet dataset and won the 1st place in the ILSVRC (ImageNet Large Scale Visual Recognition Challenge) classification competition in 2015.

Lets start with a short introduction. In recent years, application areas of deep neural networks have been expanded very quickly. This can be observed in published papers and studies in major machine learning and computer vision conferences. Especially ,convolutional neural networks architecture has quite impact on image classification, object detection etc. Microsoft’s Residual Networks (ResNet) takes CNN a step further with “small” but “important” changes in the architecture. If we would like to itemize the advantages of the ResNet:

1) To accelerate the speed of training of the deep networks

2) Instead of widen the network, increasing depth of the network results in less extra parameters

3) Reducing the effect of Vanishing Gradient Problem

4) Obtaining higher accuracy in network performance especially in Image Classification

Jian Sun from Microsoft Research Team :

“We even didn’t believe this single idea could be so significant.”

Lets go a little bit deeper…

It is clear that deep learning lives its golden era. It is not surprising to see a breakthrough improvement in this field everyday. In addition application areas of the deep learning are getting wider from finance to advertising. This results in that big players like Google, Facebook, Microsoft organized teams to study deep learning.

For whom like to do some research about these teams’ work, LeNet (1998), AlexNet (2012) , GoogleNet(2014), VGGNet (2014), ResNet(2015) are worth to look. Each of these network architectures have unique approach to different problems. For example, AlexNet has parallel two CNN line trained on two GPUs with cross-connections, GoogleNet has inception modules ,ResNet has residual connections.

AlexNet has parallel two CNN line trained on two GPUs with cross-connections, GoogleNet has inception modules ,ResNet has residual connections.

One of the main deductions from these studies is that the depth of the networks is crucial parameter and not easy to decide. Theoretically increasing number of layers should results in increase in representation capacity of the network. This is supposed to enhance the accuracy of the network. However in practice this is not the case due to:

1- Vanishing Gradients :

In some cases some neuron can “die” in the training and become ineffective/useless. This can cause information loss, sometimes very important information.

2- Optimization Difficulty :

If parameters like weights,biases increases due to increasing depth, training the network becomes very difficult. Even this causes in higher training errors.

Hence the problem becomes increasing network depth without affecting from these problems.

One of the biggest advantages of the ResNet is while increasing network depth, it avoids negative outcomes. So we can increase the depth but we have fast training and higher accuracy. Pretty, right?

Wait, what is residual connection anyway?

Normal CNN vs CNN with Residual Connection

In normal cases, we can have underlying mapping with a nonlinear H(x) function from input to output. Lets say instead of H(x), use nonlinear function F(x) which defined as H(x)-x. At the output of the second weight layer (on the right) we arithmetically add x to the F(x). Then pass F(x)+x through Rectified Linear Unit (ReLU). This enables us to carry important information in the previous layer to the next layers. By doing so we can prevent vanishing gradient problem. Even if this connection looks like an addition to standard CNN approach, surprisingly, it fastens the training of the network.

For those in the deep learning field, this approach seem familiar. Yes you are right, it is actually similar principle introduced with Long Short Term Memory (LSTM) cells.

It is actually similar principle introduced with Long Short Term Memory (LSTM) cells.

The connection carrying input to the output called shortcut connections. In the main paper published by Microsoft team, you can see the comparison of the two CNN networks, one is normal CNN and the other has residual connections, in terms of training time and accuracy on ImageNet and CIFAR-10 datasets.

According to the paper published in 2015, 152-layer ResNet was the deepest network trained on ImageNet at that time. And as promised it has lower parameter than of VGG Net which is 8x times smaller in depth. This has quite impact on faster training performance.

This improvements results in winning the 1st place in ILSVRC classification competition on ImageNet with 3.57% top 5 error.

I assume that this post gives you brief introduction about ResNet. For detailed information you can read “Deep Residual Learning for Image Recognition” paper. It is available in arxiv.org. The links for both this paper and other helpful references are given below.

Thanks for reading.

All comments are welcomed :)

For my other post you can view my profile or my personal website : nurbakier.com.

Farewell…

Links:

1- Deep Residual Learning for Image Recognition
[1512.03385] Deep Residual Learning for Image Recognition
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the…arxiv.org

2- ICML 2016 Tutorial on Deep Residual Networks
http://kaiminghe.com/icml16tutorial/index.html

3- How does deep residual learning work?
https://www.quora.com/How-does-deep-residual-learning-work

4- Microsoft researchers win ImageNet computer vision challenge
http://blogs.microsoft.com/next/2015/12/10/microsoft-researchers-win-imagenet-computer-vision-challenge/#sm.00014e7obdkjserxvch2g7mjm9pzw

5- Study of Residual Networks for Image Recognition
http://cs231n.stanford.edu/reports2016/264_Report.pdf

--

--