Weight initialization for CNNs: A Deep Dive into He Initialization

Why is weight initialization important?

As mentioned in Andy Jones’ post on Xavier Initialization:

What does it mean for the signal to be “just right”?

This is a good question, and there are probably many reasonable answers. In the literature I’ve read, the goal seems to be to set our weights such that the variance of the final output is equal to 1. This seems intuitively reasonable to me. For example, in classification, we’re generally outputting a vector of probabilities where the vector sums to 1, so the variance on any given output being equal to 1 to start seems like its in a reasonable ballpark. If you have a better explanation, let me know!

What are people using today?

When I started this investigation, I expected the answer to be Xavier Initialization, as that’s what I recalled being used in the old fast.ai library about a year ago. Andy’s post linked above is a wonderful explanation of how it works.

Assumptions of Xavier Initialization

In the He paper (which derives He Initialization), they state that the derivation of Xavier initialization “is based on the assumption that the activations are linear”. You may be saying “that seems like a crazy assumption, activation functions are always non-linear!”

tanh graphed on wolfram alpha
graph of Relu found here

Show me the math! What should we initialize our weights to?

The derivation I will go through is from the He paper titled “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”.

equation (6) in the paper
Equation 7 in the He paper

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store