Why Relu? Tips for using Relu. Comparison between Relu, Leaky Relu, and Relu-6.

6 min readJun 29, 2019

A neural network without an activation function is essentially just a linear regression model. The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks. Today we will be discussing the most commonly used activation function in the neural networks that is Relu. Relu stands for Rectified Linear Unit.

A(x) = max(0,x) , where x is the output of hidden layer.

The ReLu function is as shown above. It gives an output x if x is positive and 0 otherwise

At first look, this would look like having the same problems of the linear function, as it is linear in the positive axis. First of all, ReLu is nonlinear in nature. And combinations of ReLu are also non linear! ( in fact it is a good approximator. Any function can be approximated with combinations of ReLu). Great, so this means we can stack layers. It is not bound though. The range of ReLu is [0, inf). This means it can blow up the activation.

Another point that I would like to discuss here is the sparsity of the activation. Imagine a big neural network with a lot of neurons. Using a sigmoid or tanh will cause almost all neurons to fire in an analog way. That means almost all activations will be processed to describe the output of a network. In other words the activation is dense. This is costly. We would ideally want a few neurons in the network to not activate and thereby making the activations sparse and efficient.

A general problem with both the sigmoid and tanh functions is that they saturate. This means that large values snap to 1.0 and small values snap to -1 or 0 for tanh and sigmoid respectively. Further, the functions are only really sensitive to changes around their mid-point of their input, such as 0.5 for sigmoid and 0.0 for tanh.

The limited sensitivity and saturation of the function happen regardless of whether the summed activation from the node provided as input contains useful information or not. Once saturated, it becomes challenging for the learning algorithm to continue to adapt the weights to improve the performance of the model.

(+) It was found to greatly accelerate (e.g. a factor of 6 in Krizhevsky et al.) the convergence of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that this is due to its linear, non-saturating form.
(+) Compared to tanh/sigmoid neurons that involve expensive operations (exponentials, etc.), the ReLU can be implemented by simply thresholding a matrix of activations at zero.

Because of the horizontal line in ReLu( for negative X ), the gradient can go towards 0. For activations in that region of ReLu, the gradient will be 0 because of which the weights will not get adjusted during descent. That means those neurons which go into that state will stop responding to variations in error/ input ( simply because gradient is 0, nothing changes ). This is called dying ReLu problem. This problem can cause several neurons to just die and not respond making a substantial part of the network passive.

For example, you may find that as much as 40% of your network can be “dead” (i.e. neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.

There are variations in ReLu to mitigate this issue by simply making the horizontal line into the non-horizontal component . for example y = 0.01x for x<0 will make it a slightly inclined line rather than horizontal line. This is leaky ReLu. The main idea is to let the gradient be non zero and recover during training eventually. Leaky ReLUs are one attempt to fix the “dying ReLU” problem. Instead of the function being zero when x < 0, a leaky ReLU will instead have a small negative slope (of 0.01, or so). That is, the function computes f(x)=𝟙(x<0)(αx)+𝟙(x>=0)(x) where α is a small constant. Some people report success with this form of activation function, but the results are not always consistent. The slope in the negative region can also be made into a parameter of each neuron, as seen in PReLU neurons, introduced in Delving Deep into Rectifiers, by Kaiming He et al., 2015. However, the consistency of the benefit across tasks is presently unclear.

Relu6

From the paper: cs.utoronto.ca/~kriz/conv-cifar10-aug2010.pdf

The range of ReLu is [0, inf). This means it can blow up the activation, which also is not a favorable condition for a network. If you multiply a bunch of terms which are greater than 1, they will tend towards infinity, hence exploding gradient as you get further from the output layer if you have activation functions which have a slope > 1.

One solution to this problem is by clipping the gradient at some value n.

From the paper: cs.utoronto.ca/~kriz/conv-cifar10-aug2010.pdf, First, we cap the units at 6, so our ReLU activation function is y = min(max(x, 0), 6). In our tests, this encourages the model to learn sparse features earlier. In the formulation, this is equivalent to imagining that each ReLU unit consists of only 6 replicated bias-shifted Bernoulli units, rather than an infinite amount. We will refer to ReLU units capped at n as ReLU-n units. Also Relu-6 does not have any activation value for x < 0. I am still wondering what if we combine Leaky-Relu and Relu-6 to overcome all the disadvantages noted till date.

I have implemented the image classification for MNIST dataset using the different type of Relu activation function. What I found is that the best accuracy is given by the Relu6 activation function .

Tips for Using the Rectified Linear Activation

In this section, we’ll take a look at some tips when using the rectified linear activation function in your own deep learning neural networks.

Use ReLU as the Default Activation Function

For a long time, the default activation to use was the sigmoid activation function. Later, it was the tanh activation function.

For modern deep learning neural networks, the default activation function is the rectified linear activation function. Most papers that achieve state-of-the-art results will describe a network using ReLU. If in doubt, start with ReLU in your neural network, then perhaps try other piecewise linear.

Use ReLU with MLPs, CNNs, but Probably Not RNNs

The ReLU can be used with most types of neural networks. It is recommended as the default for both Multilayer Perceptron (MLP) and Convolutional Neural Networks (CNNs).Traditionally, LSTMs use the tanh activation function for the activation of the cell state and the sigmoid activation function for the node output. Given their careful design, ReLU were thought to not be appropriate for Recurrent Neural Networks (RNNs) such as the Long Short-Term Memory Network (LSTM) by default.

At first sight, ReLUs seem inappropriate for RNNs because they can have very large outputs so they might be expected to be far more likely to explode than units that have bounded values. Nevertheless, there has been some work on investigating the use of ReLU as the output activation in LSTMs, the result of which is a careful initialization of network weights to ensure that the network is stable prior to training.

Code: https://github.com/chinesh/mnist-relu-classificatin

Training accuracy on MNIST dataset (Please find the network in the GitHub Link):

Relu : 67.59%

Leaky Relu: 72.10%

Relu6: 71.55%

Though after trying Relu, Leaky Relu and Relu6 as the activation function, Leaky Relu gave the best accuracy, I am still skeptical why the standard/ benchmark networks such as DGN (Deep Graph Network), RetinaNet, VGG-16, Resnet use Relu in place of Leaky-Relu. Try running the code on your system. Do post your questions and suggestions in the comments. I would be happy to answer your questions.

Why Relu? Tips for using Relu. Comparison between Relu, Leaky Relu, and Relu-6.

Tips for Using the Rectified Linear Activation

Use ReLU as the Default Activation Function

Written by Chinesh Doshi