Image for post
Image for post
ResNet and ResNeXt

Understanding and Implementing Architectures of ResNet and ResNeXt for state-of-the-art Image Classification: From Microsoft to Facebook [Part 1]

In this two part blog post we will explore Residual networks. More specifically we will discuss three papers released by Microsoft Research and Facebook AI research, state of the art Image classification networks- ResNet and ResNeXt Architectures and try to implement them on Pytorch.

Prakash Jay
Feb 7, 2018 · 6 min read

About the series:

  • Understanding and implementing ResNet Architecture [Part-1]
  • Understanding and implementing ResNeXt Architecture[Part-2]
Signup for my AI newsletter

We will review the following three papers introducing and improving residual network:

Was ResNet Successful?

  • Won the 1st place in ILSVRC and COCO 2015 competition in ImageNet Detection, ImageNet localization, Coco detection and Coco segmentation.
  • Replacing VGG-16 layers in Faster R-CNN with ResNet-101. They observed a relative improvements of 28%
  • Efficiently trained networks with 100 layers and 1000 layers also.

What problem does ResNet solve?


Seeing Degrading in Action:

Worst case scenario: Deeper model’s early layers can be replaced with shallow network and the remaining layers can just act as an identity function (Input equal to output).

Image for post
Image for post
Shallow network and its deeper variant both giving the same output

Rewarding scenario: In the deeper network the additional layers better approximates the mapping than it’s shallower counter part and reduces the error by a significant margin.

Experiment: In the worst case scenario, both the shallow network and deeper variant of it should give the same accuracy. In the rewarding scenario case, the deeper model should give better accuracy than it’s shallower counter part. But experiments with our present solvers reveal that deeper models doesn’t perform well. So using deeper networks is degrading the performance of the model. This papers tries to solve this problem using Deep Residual learning framework.

How to solve?

The author’s hypothesis is that it is easy to optimize the residual mapping function F(x) than to optimize the original, unreferenced mapping H(x).

Intuition behind Residual blocks:

Image for post
Image for post
Identity mapping in Residual blocks

The authors made several tests to test their hypothesis. Lets look at each of them now.

Test cases:

Designing the network:

  1. Use 3*3 filters mostly.
  2. Down sampling with CNN layers with stride 2.
  3. Global average pooling layer and a 1000-way fully-connected layer with Softmax in the end.
Image for post
Image for post
Plain VGG and VGG with Residual Blocks

There are two kinds of residual connections:

Image for post
Image for post
Residual block
  1. The identity shortcuts (x) can be directly used when the input and output are of the same dimensions.
Image for post
Image for post
Residual block function when input and output dimensions are same

2. When the dimensions change, A) The shortcut still performs identity mapping, with extra zero entries padded with the increased dimension. B) The projection shortcut is used to match the dimension (done by 1*1 conv) using the following formula

Image for post
Image for post
Residual block function when the input and output dimensions are not same.

The first case adds no extra parameters, the second one adds in the form of W_{s}


Even though the 18 layer network is just the subspace in 34 layer network, it still performs better. ResNet outperforms by a significant margin in case the network is deeper

Image for post
Image for post
ResNet Model comparison with their counter plain nets

Deeper Studies

Image for post
Image for post
ResNet Architectures

Each ResNet block is either 2 layer deep (Used in small networks like ResNet 18, 34) or 3 layer deep( ResNet 50, 101, 152).

Image for post
Image for post
ResNet 2 layer and 3 layer Block

Pytorch Implementation can be seen here:

The Bottleneck class implements a 3 layer block and Basicblock implements a 2 layer block. It also has implementations of all ResNet Architectures with pretrained weights trained on ImageNet.


  1. Identity vs Projection shorcuts. Very small incremental gains using projection shortcuts (Equation-2) in all the layers. So all ResNet blocks use only Identity shortcuts with Projections shortcuts used only when the dimensions changes.
  2. ResNet-34 achieved a top-5 validation error of 5.71% better than BN-inception and VGG. ResNet-152 achieves a top-5 validation error of 4.49%. An ensemble of 6 models with different depths achieves a top-5 validation error of 3.57%. Winning the 1st place in ILSVRC-2015
Image for post
Image for post
ResNet ImageNet Results-2015

Implementation using Pytorch

I wrote a detailed blog post of Transfer learning. Though the code is implemented in keras here, The ideas are more abstract and might be useful to you in prototyping.

Image for post
Image for post

Please share this with all your Medium friends and hit that clap button below to spread it around even more. Also add any other tips or tricks that I might have missed below in the comments!

Image for post
Image for post

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store