Weight Initializer in Neural Networks
Why should we use Weight Initializer, as it is going to be updated by the optimizer?
In Neural Networks, it is very necessary to understand how the weights are updated to help the optimizer find the parameters that are best suited for the data to land on to the global minima.
A lot of problems are associated even after using different types of Optimizers. Therefore, making it very essential to choose the initial weights for our Neural Network. However, in most cases, the initial values for weights are random and the biases are given zero.
What if we initialize the weights with zero or a random high value?
In such a condition, while updating the weights, the derivative of weight with respect to loss will be the same in all the consequent iterations and thus there will be no changes in the weights after each epoch. This will lead to the problem of Vanishing Gradient. However, if a random high value if chosen as the initial value then it will lead to the problem of Exploding Gradient making the updated weight negative. Thus, the new and old weight will vary a lot and the gradient descent will never converge to reach the global minima.
What can we use to overcome the above problems?
Here comes the need of Weight Initializer (Kernel Initializer). It helps to overcome the problems associated with the use of inappropriate initial weights while creating a Deep Neural Network. Thus in this blog post, we are going to learn about various weight initializers that can be used to optimize the results of our Neural Network.
Before moving forward with the initializer, it is important to know the concept of Fan-in and Fan-out.
The above diagram depicts a 3-layered neural network with 3 and 2 neurons in the 1st and 2nd hidden layers respectively. So the Fan-in=3 for all the neurons in the 1st hidden layer as there are three neurons given as the input to the 1st hidden layer and Fan-out=2 as its output is fed as an input to the other two neurons in the 2nd hidden layer.
Similarly, Fan-in=3 for the neurons in the 2nd hidden layer and Fan-out=1. In this way, the Fan-in and Fan-out values can be determined for each neuron in the neural network.
TYPES OF WEIGHT INITIALIZER :
- Uniform Distribution Initializer
- Xavier (Glorot) Initializer
- He Initializer
Uniform Distribution Initializer :
This helps to identify the best initial value for our Neural Network by selecting the weights from the range of values falling between -1/sqrt(fan-in) to 1/sqrt(fan-in).
It is proposed to use this initialization technique with Sigmoid activation function as it has consistently shown great results in many of the test cases.
Xavier (Glorot) Initializer :
Under this Initialization technique, there are two types of weight initializers,
1. Xavier Normal: Here the weights are selected from a normally distributed range of values with mean (μ)=0 and standard deviation (σ)= √2/√Fan-in + Fan-out.
Keras code: model.add(Dense(32, kernel_initializer = “glorat_normal”)
2. Xavier Uniform: This initializer selects the initial weights from a uniform distribution ranging between values given by W ∼ U[-√6/√Fan-in + Fan-out , √6/√Fan-in + Fan-out ].
Keras code: model.add(Dense(32, kernel_initializer = “glorat_uniform”)
He Initializer :
Bengio and Glorot used sigmoid activation function as that was the only choice while they were proposing the concept of weight initialization. However, the ReLU activation function surpassed the results of Sigmoid when used along with this technique. This technique balances the variance of the activation named as He Initializer.
There are two types of weight initializers under this technique. 1.He(Uniform): This technique selects the initial weights from the range of values given by W ∼ U[-√6/√Fan-in , √6/√Fan-in ].
Keras code: model.add(Dense(16, input_dim=self.state_size, activation=”relu”, kernel_initializer =“he_uniform”)
2.He(Normal): While He Normal initializer selects the weights from Normally distributed values with mean (μ)=0 and standard deviation (σ)= √2/√Fan-in. W ∼ N(0,σ)
Keras code: initializer = tf.keras.initializer.he_normal
model.add(Dense(16, input_dim=self.state_size, activation=“relu”, kernel_initializer =initializer)
I have published this blog post in order to articulate my learnings and make it useful for the people who have started their journey in the field of Data Science. Please read and give your responses through claps if you like it.
Follow me for more such articles on Data Science. Thanks for reading.