Part 2: Selecting the right weight initialization for your deep neural network.

Gideon Mendels
4 min readAug 27, 2019

--

Read part 1 here.

Testing different weight initialization techniques

Modern deep learning libraries like Keras, PyTorch, etc. offer a variety of network initialization methods, which all essentially initialize the weights with small, random numbers. We’ll do a review of the different methods and show you how these different methods affect model performance.

Reminder #1: do not initialize your network with all zeros.

Reminder #2: fan-in refers to the number of units in the previous layer while fan-out refers to the number of units in the subsequent layer

  • Standard Normal initialization — this approach samples each weight from a normal distribution with low deviation
  • Lecun initialization — these initializations produce weights that are randomly selected numbers multiplied with the variance 1/fan-in
  • Xavier initialization (also called Glorot initialization) — in this approach, each randomly generated weight is multiplied by variance 2/(fan-in + fan-out). For a theoretical justification of the Xavier initialization, you can refer to the deeplearning.ai post on Initialization.
  • He initialization — this approach takes randomly generated weights and multiplies them by 2/fan-in and is recommended for ReLU activations. See the He et al. 2015 paper here.

Different frameworks have different weight initialization methods set as their default. For Keras, the Xavier initialization is the default, but in PyTorch, the Lecun initiation is the default. In the example below, we’ll show you how to implement different initialization methods in PyTorch (beyond the default Lecun method) and compare differences in performance.

Let’s run some examples!

We’ll be adapting this tutorial from Deep Learning Wizard to build a fairly straightforward, feed forward neural network and switch out which initialization method we’re using.

The most important part of the code will be the ‘Create the Model Class’ section since this is where we define our activation function and weight initialization method:

class FeedforwardNeuralNetModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(FeedforwardNeuralNetModel, self).__init__()
# Linear function
self.fc1 = nn.Linear(input_dim, hidden_dim)
### LECUN INITIALIZATION IS DEFAULT

### FOR NORMAL INITIALIZATION - uncomment to use
# nn.init.normal_(self.fc1.weight, mean=0, std=1)
### FOR XAVIER INITIALIZATION
nn.init.xavier_normal_(self.fc1.weight)
# Non-linearity
self.tanh = nn.Tanh()
# Linear function
self.fc2 = nn.Linear(hidden_dim, output_dim)
# nn.init.normal_(self.fc2.weight, mean=0, std=1)
nn.init.xavier_normal_(self.fc2.weight)

def forward(self, x):
# Linear function
out = self.fc1(x)
# Non-linearity
out = self.tanh(out)
# Linear function (readout)
out = self.fc2(out)
return out

As a specific example, if we use the Xavier initialization method with tanh activation function, the model class definition looks like this:

class FeedforwardNeuralNetModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(FeedforwardNeuralNetModel, self).__init__()
# Linear function
self.fc1 = nn.Linear(input_dim, hidden_dim)
### FOR XAVIER INITIALIZATION
# Linear weight, W, Y = WX + B
nn.init.xavier_normal_(self.fc1.weight)
# Non-linearity
self.tanh = nn.Tanh()
# Linear function
self.fc2 = nn.Linear(hidden_dim, output_dim)
nn.init.xavier_normal_(self.fc2.weight)

def forward(self, x):
# Linear function
out = self.fc1(x)
# Non-linearity
out = self.tanh(out)
# Linear function (readout)
out = self.fc2(out)
return out

And here’s the normal initialization method with tanh activation functions:

class FeedforwardNeuralNetModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(FeedforwardNeuralNetModel, self).__init__()
# Linear function
self.fc1 = nn.Linear(input_dim, hidden_dim)

### FOR NORMAL INITIALIZATION
# Linear weight, W, Y = WX + B
nn.init.normal_(self.fc1.weight, mean=0, std=1)
# Non-linearity
self.tanh = nn.Tanh()
# Linear function
self.fc2 = nn.Linear(hidden_dim, output_dim)
nn.init.normal_(self.fc2.weight, mean=0, std=1)
def forward(self, x):
# Linear function
out = self.fc1(x)
# Non-linearity
out = self.tanh(out)
# Linear function (readout)
out = self.fc2(out)
return out

Version note: if you are using using PyTorch 1.1, the lr scheduler (i.e. the schedule.step() function call) needs to be called at the end of an epoch! We are using PyTorch 1.0.0 for this example. See more in the PyTorch documentation here.

We are using Comet.ml to track the experiment details and results for these different initialization methods. Having an automated way of logging this information allows us to iterate quickly with different parameters and visually compare these different experiments.

See the full, interactive Comet.ml project here:

After running our different experiments, we can see that the Xavier initialization method gives us the highest accuracy (97.36) for tanh activations. Not surprising!

However, the Xavier initialization method also outperforms the He initialization method for ReLU activations in terms of accuracy (97.56 for He vs. 97.68 for Xavier). Interesting…

Project Visualizations and the Experiment Table allow you to build a customized view of your machine learning experiments. Test the interactive project with our different weight initialization methods and activation functions here.

We hope this post helped you gain a deeper understanding of why weight initialization matters, the background and research behind different weight initialization methods, and how to implement your choice of initialization!

Get started with Comet today.

Further Reading:

--

--

Gideon Mendels

Co-founder/CEO of Comet.ml — a machine learning experimentation platform helping data scientists track, compare, explain, reproduce ML experiments.