Weight initialization for CNNs: A Deep Dive into He Initialization

6 min readNov 6, 2018

Today is the first day of my sabbatical (thanks Asana!), so I tried to learn something useful!

I decided to investigate how state-of-the-art Convolutional Neural Networks are initializing their weights.

Why is weight initialization important?

As mentioned in Andy Jones’ post on Xavier Initialization:

If the weights in a network start too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful.
If the weights in a network start too large, then the signal grows as it passes through each layer until it’s too massive to be useful.

If the signal becomes too small, our gradient updates will be too tiny to actually learn something proper. If they are too large, we’ll make really big updates which can lead to unstable training.

What does it mean for the signal to be “just right”?

This is a good question, and there are probably many reasonable answers. In the literature I’ve read, the goal seems to be to set our weights such that the variance of the final output is equal to 1. This seems intuitively reasonable to me. For example, in classification, we’re generally outputting a vector of probabilities where the vector sums to 1, so the variance on any given output being equal to 1 to start seems like its in a reasonable ballpark. If you have a better explanation, let me know!

What are people using today?

When I started this investigation, I expected the answer to be Xavier Initialization, as that’s what I recalled being used in the old fast.ai library about a year ago. Andy’s post linked above is a wonderful explanation of how it works.

However, when I went to see how fast.ai is initializing weights today, I saw it is using nn.init.kaiming_normal_. This left me with two questions: What is kaiming_normal (aka He Initialization) and why would I use it over Xavier Initialization?

Assumptions of Xavier Initialization

In the He paper (which derives He Initialization), they state that the derivation of Xavier initialization “is based on the assumption that the activations are linear”. You may be saying “that seems like a crazy assumption, activation functions are always non-linear!”

As you’ll see when we go through the math, the math tends to work out quite similarly iff you can assume that the output from your activation function has a mean of 0. This is the case for a few activation functions, one being tanh:

However today, ReLu has become the activation function of choice in many architectures, and it certainly does not have a mean of 0.

Show me the math! What should we initialize our weights to?

The derivation I will go through is from the He paper titled “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”.

The math they walked through was quite non-obvious to me, so I hope my explanation make it more clear to the reader!

While trying to come up with the best results on ImageNet (as one studying CNN does), He and team realized that Xavier initialization wasn’t going to cut it. Their architecture frequently used ReLu-like activation functions (they even invented their own activation function in the family of ReLu activations called Parametric Rectified Linear Unit (PReLU)). In order to avoid exploding gradients, they needed to come up with a better weight initialization scheme that was better suited for their activation functions of choice.

For a convolutional layer, we can write the response as:

As explained in the paper:

Here, x is a (k^2) * c-by-1 vector that represents co-located k×k pixels in c input channels. k is the spatial filter size of the layer. With n = (k^2)* c denoting the number of connections of a response, W is a d-by-n matrix, where d is the number of filters and each row of W represents the weights of a filter. b is a vector of biases, and y is the response at a pixel of the output map

Assuming our network has L layers, we’re interested in how to initialize W such that:

So lets get started.

Our individual weights in W will all be mutually exclusive and drawn from the same distribution. This isn’t an assumption: we’re choosing how to set these.

However, we are going to assume that the elements in x are similarly mutually exclusive, drawn from the same distribution, and that W and x are independent of each other. We’ll also set all biases to 0.

Let y_l, x_l, and w_l represent the random variables of each element of our previous y_l, W_l, and x_l (yes, it is confusing to overload these, but thats what they do in the paper). That gives us:

Recall that if X and Y are independent:

This gives us:

Since we’re choosing how to initialize W, let’s assume they’ll have a mean of 0. This simplifies us to:

Again, taking advantage of the fact that our weights have mean 0:

Giving us:

Up until here, things look pretty similar to the Xavier derivation. However, in the Xavier paper they assumed that the linear activations allow the mean of the inputs of our layer to be 0. Recall that:

And thus we can’t assume that the E(x_l²) = Var(x_l).

So what is E(x_l²)?

We can calculate the expectation of a random variable by integrating it with respect to the probability density function.

If we let w_{l−1} have a symmetric distribution around zero, then y_{l−1} has zero mean and has a symmetric distribution around zero, as our bias is set to 0. This means that the above integral is half of the same integral evaluated from negative infinity to positive infinity! Thus we can re-write it as so: