This paper introduced both the PReLU activation and Kaiming initialization. We will discuss Kaiming initialization in this post.
Deep NN models have difficulties in converging when the weights are initialized using Normal Distribution with fixed standard deviation. This is because the variance of weights is not taken care of, which leads to very large or small activation values, resulting in exploding or vanishing gradient problem during backpropagation. This problem worsens as the depth of NN is increased.
In order to overcome the above problem, Xavier initialization was introduced. It tries to keep variance of all the layers equal.
W is the weight matrix between layer j and j+1. U is uniform distribution. nⱼ is the number of inputs in layer j.
But this assumes the activation function is linear, which is clearly not the case. Enter Kaiming He initialization, which takes activation function into account. For ReLU activation:
Derivation of Kaiming He initialization
This section is math-heavy, so feel free to skip it.
First, we need to know-
where X and Y are independent random variables.
Derivation of these is given at the end. You can also try to derive them on your own.
Assume, yᵏ=Wᵏxᵏ+bᵏ and xᵏ⁺¹=f(yᵏ); k is layer number and f is activation function. Here y, x and b are column vectors and W is a matrix. This is valid for Feedforward NNs as well as CNNs (as convolution can be represented as matrix multiplication).
Assumptions (valid for each layer k)-
- All elements in Wᵏ share the same distribution and are independent of each other. Similarly for xᵏ and yᵏ.
- each element of Wᵏ and each element of xᵏ are independent of each other.
- Wᵏ and yᵏ have zero mean and are symmetrical around zero.
- bᵏ is initialized to zero vector as we don’t require any bias initially.
I will remove the layer number (Wᵏ -> W) and assume general layer number is talked about. I will reintroduce it when there are multiple layers. Now, for each element yᵢ of y-
Remember that E[xⱼ²] ≠ Var(xⱼ) unless E[Xⱼ]=0. This is because of ReLU which does not have zero mean.
We now want to simply further, we haven’t used the relation between x and y; x=f(y). So, we can use it to simplify the E[xⱼ²] term.
Combining it with the previously derived expression of Var(yᵢ), we get-
This time drop the index number and keep the layer number as each element of a vector or a matrix is independent and identically distributed. So, the above equation becomes-
Combining layer 1 to L-
To prevent exploding or vanishing gradients problem, we want variance at the input= variance at the output. It will happen only if each term inside the product=1. ie-
Hence, we reach the previously mentioned formula-
In the above case, we have derived the initialization using forward pass only. Similar results can be obtained during backpropagation.
This formula is valid only when we use ReLU in each layer. For a different activation function, we can derive the initialization by using the different activation function in the integrand of E[xⱼ] term. For PReLU case we obtain-
a=initialized slope of PReLU.
- if a=0, we get ReLU case
- if a=1, we get linear case
If you run the below code you will see var(y) is close to 1, which is var(x).
import torch.nn.functional as F
- Kaiming He initialization- https://arxiv.org/abs/1502.01852
- Xavier Glorot initialization- http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf
- Convolution as matrix multiplication- https://stackoverflow.com/a/44039201