Generalised Method For Initializing Weights in CNN

Harshit_Babbar
Analytics Vidhya
Published in
3 min readSep 10, 2020

--

Initialising the parameters with right values is one of the most important conditions for getting accurate results from a neural network.

Initialisation of weights with all values zero

If all the weights are initialized with zero, the derivative with respect to loss function is the same for every w in W[l] where W[l] are weights in the layer l of neural net, thus all weights have the same value in subsequent iterations. This makes hidden units symmetric and continues for all the n iterations i.e. setting weights to zero does not make it better than a linear model, hence we should not initialise it with zeroes.

Initialisation of Weights with too large or too small values(Random Weights)

If the weights are too large then we have a problem called explosive gradient problem which leads to divergence from the minimum loss.

If the weights are too small then we have a problem called vanishing gradient problem which leads to convergence before reaching minimum.

For deeper understanding refer this:

How to initialise the weights

Weights must be initialised such that:

  • The Mean of activations at each layer is close to zero.
  • The Variance of activations at each layer is close to one.

One way to do this we attach a Pytorch hook at each layer of the neural nets. Pytorch hook is basically a function, with a very specific signature. When we say a hook is executed, in reality, we are talking about this function being executed. The hook takes the input to layer and output of the layer as argument and we can grab mean and standard deviation of these activations for debugging the model and see why is the model not working correctly.

class Hook():
def __init__(self, m, f):
self.hook =m.register_forward_hook(partial(f, self))
def remove(self):
self.hook.remove()
def __del__(self):
self.remove()
def append_stat(hook, mod, inp, outp):
d = outp.data
hook.mean,hook.std = d.mean().item(),d.std().item()

The code snippet above create a hook which is a function which will be attached to the layer m which is passed as the initialisation argument. The function here calculates the mean and standard deviation of activations at that layer.


def
children(m):
return list(m.children())
hooks = [Hook(l, append_stat) for l in children(learn.model)]

The code snippet above attaches the hook to each layer of the neural net.

Hence when we do a forward pass on the neural network all the means and standard deviation of the activations at each layer are calculated.

def lsuv_module(m, xb):
h = Hook(m, append_stat)
while learn.model(xb) is not None and abs(h.mean) > 1e-3:
m.bias = nn.Parameter(m.bias-h.mean)
while learn.model(xb) is not None and abs(h.std-1) > 1e-3: m.weight.data = nn.Parameter(m.weight.data/h.std) learn.model(xb)
h.remove()
return "mean :"+str(h.mean),"std :"+str(h.std)
for m in models:
print(lsuv_module(m, torch.tensor(xb)))
#xb is batch of input data

The code snippet above is traversing through all the layers of the model one by one.

For each layer l in models:

  • Calculate mean and standard deviation of all the activations.
  • If the mean of all activations at that layer is greater than abs(1e-3) then we subtract that mean value from biases of all the neurons in the layer so that the new mean comes down close to zero after this step.
  • Similarly if we divide all the weights of the layer by standard deviation of activations of that layer, the new standard deviation will come out to be close to one.

Results

Mean and std for layer 4 network using kaiming uniform initialisation

These are results obtained after using kaiming uniform initialisation which tend to perform much worse as the network goes deep. Here the mean and standard deviation are far away from zero and one respectively.Hence,it leads to less accurate results.

The results obtained after using the generalised method are as follows:

Mean and std for layer 4 network using generalised initialisation

As we can see that the mean and std at each layers are close to zero and one respectively,hence we can prevent the vanishing and explosive gradient problem. Hence this initialisation works better for neural nets.

--

--