How to initialize weights in Neural Network?

An intuitive and straightforward tutorial about the three most popular weight initialization methods helps you pick the right one for your project.

Maciej Balawejder
Nerd For Tech
5 min readApr 6, 2022

--

Photo by Kier In Sight on Unsplash

Introduction

Weight initialization is a model design parameter that the wrong choice might slow down or stall the convergence. You can think of it as a starting point for the loss function landscape.

An intuitive guess would be to begin with, 0, but it also leads to 0 gradients, hence no learning at all.

The other option is randomly sampling points from distribution. In this blog we will stick to the gaussian distribution which has two parameters, mean and variance. These values are defining the spread of the distribution.

It’s all about variance

Let’s say we have a simple perceptron where we pass the input x.

Our variance for each weight is also summed up, which means it grows with each layer. This summed output has a much wider spread of values.

Let’s test it on the conceptual 10-layer Neural Network. In real life, training such a model takes time and computation power. We can simply assume that we perform ten matrix multiplication through the ten-layer “network” and measure the mean and variance between layers.

Check out my Github if you want to see complete code
Average mean and variance in the title

As you can see, the variance is exploding already in the second layer. Plotting the rest of the values would make the graph unreadable.

The main point is that huge values are undesirable in the network. They make the model slower and can cause the exploding gradient problem. Thus we want to keep same distribution in all layers.

LeCun Intialization

n-in — number of inputs

It was the first attempt to keep the same variance throughout the network. It is the default method used in PyTorch right now. We simply scale down the initialization according to the number of inputs coming to the layer.

Our network performance:

As you can see we keep fairly similar distributions throught the model, but we miss one important factor here.

How is the variance affected by the activation function and backpropagation?

Xavier Initialization

Xavier’s 2010 paper discusses the influence of activation function and backpropagation on variance throughout the network. They figure out that the variance after Tanh decreases, and gradients vanish in the backpropagation.

[1]

To combat it, they introduced a new initialization method created for activation function with the symmetric point in 0, like Tanh. And also normalized the variance in backpropagation gradients.

[1]

If you want to dig more deeply into statistics and how they came up, I recommend this blog post.

Our network performance:

Left: first 6 layers, Right: last 4 layers

Alright, so the network keeps similar distribution only when the number of inputs and outputs is equal.

The explanation is the network used in Xavier’s paper. Their feedforward model has 1000 hidden units in each layer. Thus the results they achieved look nice and smooth.

What about ReLU?

Since ReLU is a non-symmetric function, it performs poorly with the Xavier initialization.

Thus, after AlexNet(2012), most models performed weight initialization with 0 mean and 0.01 standard deviation. The shortcomings started to occur with deeper models, which are more prone to vanishing/exploding gradient problems. The perfect example is VGG19 model that had to use weights from pre-trained 11-layer shallow network.

He Initialization

In a 2015 paper, Kaimar analyzed the influence of the ReLU function on the output variance and came up with a new initialization.

Our network performance:

Since Deep Learning started to go deeper with 30 layers or more. I also expanded our model to 50 layers and tested it.

50-layers network

As you can see, He initialization keeps relatively similar mean and variance through all layers. The mean is obviously shifted because of the non-symmetrical property of ReLU.

He initialization was used to train the ResNets in Deep Residual Learning for Image Recognition paper.

Discussion and Conclusions

Since 2015 when Batch Normalization was released, the significance of weight initialization decreased. Batch Normalization scales down the activations in each layer, which speed up the computation and reduce the problem of exploding gradients. Thus, some of the issues addressed by careful initialization were solved.

Nevertheless, Yang and Schoenholz[2] showed that neither Xavier nor He holds optimal variance for ResNet and that initialization should depend on depth. It proves that weight initialization is still an active area of research.

The point of this blog was to introduce different initialization techniques and provide more inside into how Neural Networks work. Hopefully, my explanation and visualization will help you understand it and inspire you to explore this topic further.

Check my Medium and Github profile if you want to see my other projects.

Reference

[1] Understanding the difficulty of training deep feedforward neural networks

[2] Mean Field Residual Networks: On the Edge of Chaos

--

--