Deep Learning Course — Lesson 10.4: Weight Initialization Techniques

Machine Learning in Plain English
2 min readJun 10, 2023

--

Proper weight initialization in a neural network can significantly enhance the performance of the network. If the starting weights are too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful. If the weights are too large, then the signal grows exponentially as it passes through each layer until it’s too large to be processed. Thus, the choice of weight initialization can play a significant role in how quickly a neural network can learn and converge to the optimal solution.

There are several weight initialization techniques, but I will discuss two of the most commonly used methods: Xavier and He initialization.

  1. Xavier Initialization (also known as Glorot Initialization): This technique is named after Xavier Glorot, one of the first authors of a paper that introduced this concept. The idea behind Xavier initialization is to make the variance of the outputs of a neuron to be equal to the variance of its inputs. Xavier Initialization sets the initial weights of the neural network by drawing them from a distribution with zero mean and specific variance. The variance is 1/n, where n is the number of input units. This distribution can be either a uniform distribution or a normal distribution.
  2. He Initialization: This technique was proposed in a 2015 paper by Kaiming He, hence the name. It’s a variant of Xavier initialization, specifically designed for neural networks with ReLU activation functions. In He initialization, the variance of the distribution is 2/n, where n is the number of input units. This adjustment accounts for the fact that the ReLU activation function halves the variance of the output, due to only passing through positive inputs. Like Xavier initialization, this distribution could be either a uniform distribution or a normal distribution.

Both Xavier and He initialization are popular and well-proven methods for initializing the weights of neural networks and often work better in practice than small random numbers or other more naive methods. The choice between Xavier and He usually depends on the activation function used in the neural network. For networks with ReLU (and variants) activation functions, He initialization is preferred, while Xavier initialization is usually used with sigmoid and tanh activation functions.

--

--