# Why is weight initialization important?

As mentioned in Andy Jones’ post on Xavier Initialization:

# What does it mean for the signal to be “just right”?

This is a good question, and there are probably many reasonable answers. In the literature I’ve read, the goal seems to be to set our weights such that the variance of the final output is equal to 1. This seems intuitively reasonable to me. For example, in classification, we’re generally outputting a vector of probabilities where the vector sums to 1, so the variance on any given output being equal to 1 to start seems like its in a reasonable ballpark. If you have a better explanation, let me know!

# What are people using today?

When I started this investigation, I expected the answer to be Xavier Initialization, as that’s what I recalled being used in the old fast.ai library about a year ago. Andy’s post linked above is a wonderful explanation of how it works.

# Assumptions of Xavier Initialization

In the He paper (which derives He Initialization), they state that the derivation of Xavier initialization “is based on the assumption that the activations are linear”. You may be saying “that seems like a crazy assumption, activation functions are always non-linear!”

# Show me the math! What should we initialize our weights to?

The derivation I will go through is from the He paper titled “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”.

--

--