Rectified Linear Unit (ReLU) and Kaiming Initialization

Marvin Wang, Min
AI³ | Theory, Practice, Business
3 min readSep 2, 2019

Use ReLU as the Default Activation Function

For a long time, the default activation to use was the sigmoid activation function. Later, it was the tanh activation function.

For modern deep learning neural networks, the default activation function is the rectified linear activation function. Most papers that achieve state-of-the-art results will describe a network using ReLU. If in doubt, start with ReLU in your neural network, then perhaps try other piecewise linear.

Relu stands for Rectified Linear Unit.

R(x) = max(0,x) , i.e if x < 0 , R(x) = 0 and if x >= 0 , R(x) = x.where x is the output of hidden layer.

The ReLu function is as shown above. It gives an output x if x is positive and 0 otherwise.

Relu
Relu

Due to the nature of Relu, if we want a few neurons in the network to not activate and thereby making the activations sparse and efficient. Imagine a network with randomly initialized weights ( or normalized) and almost 50% of the network yields 0 activations because of the characteristic of ReLu ( output 0 for negative values of x ). This means fewer neurons are firing ( sparse activation ) and the network is lighter.

And also because of the horizontal line in ReLu( for negative X ), the gradient can go towards 0. For activations in that region of ReLu, gradient will be 0 because of which the weights will not get adjusted during descent. That means those neurons which go into that state will stop responding to variations in error/ input ( simply because the gradient is 0, nothing changes ). This is called Dying ReLu problem. This problem can cause several neurons to just die and not respond to making a substantial part of the network passive.

And there are other types of Relu.

One variation of Relu to mitigate this issue by simply making the horizontal line into a non-horizontal component . for example y = 0.01x for x<0 will make it a slightly inclined line rather than a horizontal line. This is leaky ReLu. There are other variations too. The main idea is to let the gradient be non zero and recover during training eventually. Unlike ReLU, leaky ReLU is more “balanced,” and may, therefore, learn faster.

R(x) = max(x, αx), where α is a small constant

There are other alternatives, but both practitioners and researchers have generally found an insufficient benefit to justify using anything other than ReLU.

If in doubt, start with ReLU in your neural network, then perhaps try other piecewise linear activation functions to see how their performance compares.

Use “Kaiming Initialization”

Before training a neural network, the weights of the network must be initialized to small random values.

When using ReLU in your network and initializing weights to small random values centered on zero, then by default half of the units in the network will output a zero value.

There are many heuristic methods to initialize the weights for a neural network, yet there is no best weight initialization scheme and little relationship beyond general guidelines for mapping weight initialization schemes to the choice of activation function.

Kaiming He, et al. in their 2015 paper titled “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification” suggested that Xavier initialization and other schemes were not appropriate for ReLU and extensions.

They proposed a small modification of Xavier initialization to make it suitable for use with ReLU, now commonly referred to as “Kaiming initialization” (specifically +/- sqrt(2/n) where n is the number of nodes in the prior layer known as the fan-in). In practice, both Gaussian and uniform versions of the scheme can be used.

--

--