Day 8: Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (Kaiming initialization)

Francisco Ingham

Follow

Published in

A paper a day avoids neuron decay

8 min readMar 29, 2019

--

[Feb 6, 2015] How to initialize your deep ReLU activated networks to avoid vanishing and exploding gradients

TL-DR

Initialization is no joke and depends greatly on the activation function you use. The wrong initialization could lead to vanishing or exploding gradients which, in turn, can slow or even stop convergence. This paper proposes an initialization heuristic which allows to train ReLU activated deep networks in a stable manner.

Introduction

This paper presented a series of improvements that allowed the authors to win the 2015 ImageNet competition: the PReLU activation function and a new initialization scheme for deep NN’s with non-linear activation functions.

Why is initialization important?

Intuitively, initialization is important because if a network acts as a variance amplifier, both during forward propagation and backward propagation. If the variance is high or low, the variance amplifier effect will amplify this variance so that the last layers in each computation (last layers in forward propagation which goes 1 → n and the first layers in backward propagation which goes n → 1) will suffer from very high or very low values which will tend to infinity or zero and will effectively ‘kill’ neurons. This is what is commonly referred to as ‘exploding’ or ‘vanishing’ gradients (or neurons on the forward pass).

If the forward/backward signal is inappropriately scaled by a factor β in each layer, then the final propagated signal will be rescaled by a factor of β L after L layers, where L can represent some or all layers. When L is large, if β > 1, this leads to extremely amplified signals and an algorithm output of infinity; if β < 1, this leads to diminishing signals. In either case, the algorithm does not converge — it diverges in the former case, and stalls in the latter.

Forward Propagation

Remember that for a convolution layer the forward propagation algorithm is a matrix multiplication with a bias like so:

x represents an image. It is a vector containing all the pixels, for each channel of the image. As such, its length is k²*c where k is the length of the image (and width, we assume images to be square) and c is the number of channels. In turn W represents the convolutional weights in matrix form with dimensions d by n where d is the number of filters and n corresponds to the x vector’s length k²*c. y is the vector result of convolving x and adding b, the bias vector. Finally, we use l to index a layer.

Note that x_l = f(y_(l-1)) where f is the activation function of the previous layer. Also, c_l = d_(l-1); the number of channels of the next layer is the number of filters in the previous layer.

We assume that:

The activation function is ReLU for every layer
W_l’s are mutually independent and share the same distribution and the same applies for the x_l’s
x_l and W_l are independent of each other
The elements in W_l (we call them w_l) have zero mean

If we assume this, after some derivation (doing math in Medium is impossible) we get:

The recurrence that defines the relationship between the variance of different layer’s weights in forward propagation

This equation is really important. We can very clearly see how a <1 term within the multiplication would tend to zero and a >1 term would tend to infinity. In particular we need to keep the variance steady and stable. How can we do this? Let’s solve the equation! We are trying to understand what we should initialize each of the w_l’s to so we have to solve for w_l:

We need to keep this constant across layers

To allow for this we need the variance of w_l to compensate the 1/2*n_l factor (1):

Kaiming’s big result #1: The variance of the w_l’s compensates the factor

Backward Propagation

We also want to avoid vanishing or exploding gradients when going backwards, while updating our weights. Backward propagation is defined by the following function where delta(x_l) is the gradient on the input for each layer and delta(y_l) is the gradient on the pre-activation output for each layer:

delta(y) is a vector with length k²*d. In the back-prop case we will define a value similar to n_l but different. In this case it will be called n_hat and is defined as n_hat as delta(y)’s length k²*d. W_hat, in turn, is a c by n_hat matrix where filters are opposite as W, in the way of back-propagation (note that W_hat and W are the same matrix, arranged differently). delta(x) is a c length vector which contains the gradient at a pixel of this layer.

Note that delta(y_l) = f’(y_l)*delta(x_(l+1)) where f is the activation function of the previous layer. Also, for the ReLU case, f’(y_l) is either zero and one with equal probability (because y_l is symmetric around 0).

We assume again that:

w_l and delta(y_l) are independent of each other
delta(x_l) has zero mean for all l when w_l is initialized as a symmetric distribution around 0
f’(y_l) and delta(x_(l+1)) are independent of each other

n returns as nhat

As you will see, the equations that keep the backwards propagation sanity are the same as in forward propagation but with nhat_l replacing n_l.

After some more derivation we get:

The recurrence that defines the relationship between the variance of different layer’s delta(x)’s in back-propagation

Again a recurrence. And again we need to be careful for some value not to explode or vanish as we run the algorithm:

We need to keep this constant across layers, part 2

Again we need the variance of w_l to compensate the 1/2*nhat_l factor (1):

Kaiming’s big result #2: The variance of the w_l’s compensates the factor

We have two but we can only initialize with one

Which one? Long story short: it does not matter. You can use either one. Why?

Let’s say we use the one we derived in the backwards propagation section. Then the equation for forward propagation would be:

Instead of 1, the constant for forward propagation would be c_2/d_L

c_2 is the number of channels in layer 2 and d_L is the number of filters in the last layer .You can see how this result is derived if you replace n_l and n_hat_l by their definitions and using that c_l = d_(l-1).

Remember that this equation holds for Var[y_L]. We can generalize by saying that:

Equation generalized for an arbitrary layer for forward propagation

So basically this means that we would be computing the number of filters of the first layer over the number of filters in each of the following layers. These numbers do not vary much in the most common neural network architectures and thus, their quotient will likely not be too high or too low.

In the inverse case, if we use the weight initialization technique we derived in the forward propagation section, we would have in backward propagation:

Equation generalized for an arbitrary layer for backwards propagation

And the same reasoning applies.

A word on Xavier

Xavier initialization is an initialization method from another paper, released 4 years before this one, which assumed a linear activation function (with derivative equal to 1 at 0). Their equation was:

The Xavier equation

And their solution misses the 1/2 factor we have for ReLU in Kaiming:

Xavier initialization for forward propagation

PReLU

The Parametric Rectified Linear Unit or PReLU is a variant of the ReLU activation function where the coefficient of the negative part is not zero by default but can be learned.

The negative part is not constrained to have 0 slope

The authors argue that this improvement with regards to ReLU improves accuracy at a negligible computational cost (the number of extra parameters is equal to the total number of channels in the network and this is negligible when compared to the total number of weights).

Results

Initialization

The authors tried their initialization method in a deep, ReLU activated convolutional network and compared their results with the Xavier method.

They ascertained that:

Their method converges faster (Figure 1)
This effect is amplified the deeper the network is, and for very deep networks Xavier stalls and does not even converge (Figure 2)
If they both converge, the methods do not differ in accuracy

Figure 1: Kaiming initialization leads to convergence faster in a 22-layer model

Figure 2: Kaiming initialization converges in a 30-layer model, whereas Xavier initialization does not

PReLU

PReLU improved baseline error by 1.2%. The difference between channel-shared and channel-wise refers to the option of tying the parameter for PReLU across channels in a layer or giving each channel its own parameter. For the channel-shared version, the PReLU version only adds 13 parameters but these are enough to improve performance by 1.1% (significant improvement in performance without a significant increase in computational cost).

Notes

(1) For the first layer the correct coefficient would be 1/n_l (or 1/nhat_l in the case of back-prop) due to there not being a ReLU activation on the input. But the factor difference of 1/2 does not make a difference in one layer (this layer is not part of the recurrence) so for simplicity the same initialization factor is used in all layers.

References

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, Kaiming et al., Microsoft, 2015

Understanding the difficulty of training deep feedforward neural networks, Glorot et al., Universite de Montreal, 2010

Pierre Ouannes, How to initialize deep neural networks? Xavier and Kaiming initialization, Mar 22, 2019

Image source: Flickr

Professor Xavier: PapelPop

Day 8: Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (Kaiming initialization)

TL-DR

Introduction

Why is initialization important?

Forward Propagation

Backward Propagation

n returns as nhat

We have two but we can only initialize with one

A word on Xavier

PReLU

Results

Initialization

PReLU

Notes

References

Written by Francisco Ingham