Weight Initialization in Neural Network, inspired by Andrew Ng
Deep Neural Network can have a common problem of vanishing and exploding gradient descent. To solve this particular issue, one solution could be to initialize parameters carefully. In this article I will talk about Weight initialization techniques.
Let’s consider the following dataset from sklearn.datasets, where we would like to separate the blue dots from the red dots.
To do so, we will implement a three layer neural network model and see the experimented results of the following weight initializing methods
- Zeros initialization
- Random initialization
- He initialization
Zero initialization:
As its name suggests, implementing this method sets all the weights to zeros. This method serves almost no purpose as it causes neurons to perform the same calculation in each iterations and produces same outputs. How? If all the weights are initialized to zeros, the derivatives will remain same for every w in W[l]. As a result, neurons will learn same features in each iterations. This problem is known as network failing to break symmetry. And not only zero, any constant initialization will produce a poor result.
Applying a 3-layer model with zero initialization on the above dataset for 15000 iterations produces the following result.
W[l] = np.zeros((l, l-1))
Here, loss = 0.6931471805599453 and accuracy = 50% . As you can see, the performance is as bad as random guessing. Clearly, Zero initialization is no more powerful than linear model or logistic regression. To solve this issue we have to find a way to break the symmetry. Initializing weights randomly to W[l] perhaps can do the magic! Let’s see!
Random Initialization:
Random initialization is generally used to break the symmetry and this process gives much better accuracy than zero initialization. It prevents neuron from learning the same features of its inputs. Remember, neural network is very sensitive and prone to overfitting as it quickly memorizes the training data. But our goal is to make each neuron learns different functions of its input. Now a new problem may arise if the weights initialized randomly can be very high or very low. Why?
Well, If the weights are initialized with high value, the term np.dot( W,X)+b becomes larger. Then applying a sigmoid function, maps the value to 1 which causes slower change in the slope of gradient descent. As a result, learning takes a lot of time!!
Similar case happens when the weights are initialized with much lower value. Sigmoid function tends to map the value to zero and eventually it slows down the optimization. Now, let’s apply the same model with random initialization on the above dataset.
W[l] = np.random.randn(l-1,l)*10
Here, for 15000 iterations, loss = 0.38278397192120406 and accuracy = 86 %. Seems like the symmetry is broken and accuracy is quiet satisfying and better this time. However, in this process, the cost starts with a very high value. See the below graph. Probably training this network for longer time will produce better result but it still can’t satisfy our need of faster learning rate and optimized model.
So, initializing high or low value randomly doesn’t really work well. Let’s see if He-initialization can work better for us.
He-et-al Initialization:
This method is named after an author He-et-al from a famous paper published in 2005. It is almost similar to Xavier initialization except that they use different scaling factors for the weights. The scaling factor for Xavier initialization is sqrt(1./layers_dims[l-1])
and the scaling factor for He-et-al is sqrt(2./layers_dims[l-1])
. In Xavier initialization, we want the variance to remain same for each passing layers. It helps to keep the signal from exploding to a high value or vanishing to zero. If you want to know more about Xavier initialization, you can read this. Implementing the same model with he-et-al initialization produces the below result:
w[l]= np.random.randn(layers_dims[l], layers_dims[l-1])* np.sqrt(2./layers_dims[l-1])
Here, for 15000 iterations, loss = 0.07357895962677369 and accuracy = 96%. As you can see, the model with He initialization classifies the blue and the red dots very well. One thing to remember is, He initialization works better with Relu activation fucntion.
Conclusion:
- Zero initialization causes the neuron to memorize the the same functions almost in each iterations
- To break the symmetry, Random initialization is a better choice however, initializing much high or low value can result in slower optimization.
- Using an extra scaling factor in He initialization can solve the above issue to some extent. That’s why it is a more recommended weight initialization method among these three.