Why Initializing a Neural Network is Important!

Yuanrui Dong
AI³ | Theory, Practice, Business

--

Generally, neural network models rely on stochastic gradient descent for model training and parameter updating. The final performance of the network is directly related to the optimal solution obtained by convergence, and the convergence result actually depends largely on the initialization of network parameters.

The ideal initialization of network parameters makes the model training get less efforts. On the contrary, the poor initialization scheme will not only affect the network convergence but also lead to gradient dispersion or explosion. So initilization is important in a neural network.

Here, we take an example: matrix multiplications. Let’s take a vector x, and a matrix a which is initiliazed randomly,

>>> x = torch.randn(512)
>>> a = torch.randn(512,512)

Then multiply them 100 times,

>>> for i in range(100): x = a @ x if x.std() != x.std(): break
>>> x.mean(),x.std()
(tensor(nan), tensor(nan))

To explore what happened in the loop, let’s check why it’s not a number.

>>> for i in range(100): 
x = a @ x
if x.std() != x.std(): break
>>> i
28

The outcome was only 28 iterations before it ended. It shows that when you multiplied by a matrix lots of times you explode out to the point that your computer can’t even keep track.

>>> x = torch.randn(512)
>>> a = torch.randn(512,512) * 0.01
>>> for i in range(100): x = a @ x
>>> x.mean(),x.std()
(tensor(0.), tensor(0.))

If we start set of weights like 0.01 or 1 , it can’t learn anything. Because there are no gradients, the gradients are dispersion or explosion. So it needs to have a reasonable starting point, and this is why for decades people weren’t able to train deep neural network. Thus it’s critical to initialize the network.

There’s lots of initialization approaches that we can use have been shown in some papers. These papers are shown in Fig 1.

Fig 1. Initialization approaches
  1. 《Understanding the difficulty of training deep feedforward neural networks》. Using Xavier initialization, activation values are well distributed at each layer of the deep network.
  2. 《Delving Deep into Rectifiers》Kaiming_He initialization is proposed by taking ReLU/PReLU into account.

3. 《All You Need is a Good Init》which describes how you can kind of iteratively go through your network and set one layer of weights at a time to like literally doing a little optimize to find out which set of parameters gets a unit variance at every point。

4. 《Exact solutions to the nonlinear dynamics of learning in deep linear neural networks》 In this paper the Orthogonal initialization is proposed.

5. 《Fixup Initialization》 and 《Self-Normalizing Neural Networks》 describe how to try to set a combination of kind of activation functions and in it such that you’re guaranteed a unit variance as deep as you like . And both of those two papers went to something like a thousand layers deep and trained them successfully. In both cases we can get rid of batch norm. But It’s very unlikely be true, because the formula is so long and the caculation is a huge job.

In PyTorch, variety of parameter initialization functions are provided.

  • torch.nn.init.constant(tensor, val)
  • torch.nn.init.normal(tensor, mean=0, std=1)
  • torch.nn.init.xavier_uniform(tensor, gain=1)

--

--