Residual Networks

Published in

DataX Journal

4 min readJul 15, 2020

Deep neural networks are difficult are to train because of vanishing and exploding gradient problems. To overcome this issue, we use ResNets. In ResNets we’ll see skip connections that allow you to take the activation from one layer and suddenly feed it to another layer even much deeper in the neural network using this we’ll build ResNet which enables you to train even very deep networks. Sometimes even networks over 100 layers. When training a neural network, the goal is to map the model to a target function h(x). If you add the input x to the output of the network, which is basically adding a skip connection, then the network will be forced to map f(x) = h(x) — x rather than h(x). This is called residual learning.

Residual block

ResNets are build of residual blocks. Residual block also referred to as residual units. In ResNets we take activation (a[l]) and add it further in the neural network.

So, rather than needing to follow the main path, the information from A[l] can now follow a shortcut to go much deeper into the neural network. So, we have,

> Z[l+2] = W[l+2] A[l+1] + b[l+2]

> A[l+2] = g(Z[l+2] + A[l] )

Addition of A[l] here, it makes this a residual block or residual unit.

Advantage of Skip Connection

According to theory plain network (without skip connection) should do better and better if you increase the number of layers but in practice with plain networks, this is not the case. As shown below

With ResNet (on the right) as the number of layers gets deeper, you can have the performance of the training error kind of keep on going down. By taking on these activations be it X of these intermediate activations and allowing it to go much deeper in the neural network, this really helps with the vanishing and exploding gradients problems.

What makes ResNets Special?

We are using the Relu activation function throughout this network.

> A[l+2] = g ( Z[l+2] + A[l] )

> A[l+2] = g(W[l+2]A[l+1] + b[l+1] + A[l] )

If W[l+2] = 0 , b[l+2] = 0 then ,

A[l+2] = g(A[l]) =A[l] because of relu function which gives same value if positive value is there and here A[l] is positive

This shows that Identity function is easy for Residual block to learn which is getting A[l+2] = A[l] because of skip connection. That means adding these 2 layers in your neural network, it doesn’t really hurt your neural network able to do as well as a simpler network without these extra layers. What goes wrong in very deep plain neural networks without this residual of the skip connection is that when you make the network deeper and deeper, it’s actually very difficult for it to choose parameters that learn even the identity function which is why a lot of layers end up making result worse rather than making your result better.

ResNet Architecture

Now, let’s see the ResNet’s architecture. It is actually surprisingly simple. It’s very much like GoogLeNet only it does not have a dropout layer. There is a stack of residual units in between. Each residual unit is composed of two convolution layers and no pooling layer, each layer also has Batch Normalization and ReLu activation, using 3 x 3 kernels and preserving spatial dimensions.

Here, you might notice that feature maps are doubled every few residual units, at the same time their height and width are halved. Let’s see ResNet-34. It is the ResNet with 34 layers only including convolutional layers and fully connected layers. It contains 3 residual blocks that output 64 feature maps, 4 residual blocks with 128 maps, 6 residual blocks with 256maps, and 3 residual blocks with 512 feature maps.

ResNets deeper than ResNet-34, such as ResNet-152, use somewhat different residual units. They use 3 convolutional layers: first a 1 × 1 convolutional layer with just 64 feature maps (4 times less), which acts as a bottleneck layer (as discussed already), then a 3 × 3 layer with 64 feature maps, and finally another 1 × 1 convolutional layer with 256 feature maps that restores the original depth. ResNet-152 contains three such RUs that output 256 maps, then 8 RUs with 512 maps, a whopping 36 RUs with 1,024 maps, and finally 3 RUs with 2,048 maps.

Residual Networks

Residual block

Advantage of Skip Connection

What makes ResNets Special?

ResNet Architecture

Written by Aditya Mangla