Day 10: All You Need Is A Good Init

Francisco Ingham

Published in

A paper a day avoids neuron decay

5 min readApr 22, 2019

[Nov 19, 2015] How to initialize your deep networks for any activation function

If your init you trust, to anything you adjust

TLDR

Universal initialization for deep nets.

This approach achieves SOTA/close to SOTA in the same training time as standard methods for many different activation functions, architectures and datasets.

The Algorithm

The authors propose a data-driven weights initialization. Their algorithm is based on Saxe et al. (2014) and can be implemented in two steps:

Fill the weights with Gaussian noise with unit variance
Decompose them to orthonormal basis with QR or SVD decomposition and replace weights with one of the components

The next step is to estimate the output variance of each convolution and inner product layer and scale the weight to make variance equal to one.

This combined approach conserves the unit variance while at the same time the initial orthonormalization ensures that layer activations are de-correlated.

The algorithm is as follows. After the network is pre-initialized with orthonormal matrices, the layer’s output variance is calculated and as long as it is farther away from 1 than the tolerated margin, the weights are scaled by the square of the variance of the current batch’s output.

The algorithm for this initialization method

Why does this work?

Let’s start by our basic feed-forward equation where X_i is our minibatch for iteration i, W_i our weights and B_i our output for this iteration.

Let’s say we take the variance of the output and turns out we are not yet in our tolerated range.

That means we need to continue iterating. The next batch comes along and we scale the weights by the square root of the variance of the previous iteration’s output:

How does this impact Var(B_i+1). Let’s see.

So, by updating the weights, we have:

If we had not updated the weights we would have:

The difference is the scaling of 1/Var(B_i). Notice that:

This is the key equation here. If Var(B_i) and Var(B^(noscale)_(i+1)) are similar then the scaled output variance should tend to one. These two values will be similar when the variance of the output of different mini batches is similar. Intuitively, for this to happen, mini batches should be large enough. According to empirical work done by the authors, the minimum size is 16.

Experimental Validation

Accuracy and speed

FitNets (thin, deep nets) were chosen for most of the experiments since they are accurate and inference-time efficient.

A Fitnet with their suggested initialization method achieved SOTA in CIFAR-10 with common data augmentation and MNIST without data augmentation.

Accuracy on CIFAR 10–100 and error on MNIST

Performance of orthonormal-based methods is superior to the scaled Gaussian-noise approaches for all tested types of activation functions, except tanh. (…) Proposed LSUV strategy outperforms orthonormal initialization by smaller margin, but still consistently. All the methods failed to train sigmoid-based very deep network.

Compatibility of activation functions and initialization

Convergence times for different initialization schemes

LSUV is the only initialization algorithm which leads nets to convergence with all tested non-linearities without any additional tuning, except, again, sigmoid.

Compatibility of activation functions and initialization with ResNet

LSUV vs BN

LSUV procedure could be viewed as batch normalization of layer output done only before the start of training. Therefore, it is natural to compare LSUV against a batch-normalized network, initialized with the standard method.
LSUV-initialized network is as good as batch-normalized network.
However, we are not claiming that batch normalization can always be replaced by proper initialization, especially in large datasets like ImageNet.

Speed to converge with batch norm vs LSUV

LSUV initialization reduces the starting flat-loss time from 0.5 epochs to 0.05 for CaffeNet, and starts to converge faster, but it is overtaken by a standard CaffeNet at the 30-th epoch (see Figure 4) and its final precision is 1.3% lower. We have no explanation for this empirical phenomenon.

On the contrary, the LSUV-initialized GoogLeNet learns faster than hen then original one and shows better test accuracy all the time. The final accuracy is 0.680 vs. 0.672 respective.