Activation Functions in Neural Networks

Kshitij Khurana
7 min readNov 3, 2019

--

Common activation functions in neural networks

In the process of training neural network one of the important hyper-parameter is the activation function and we need to make a choice regarding what activation function to use in hidden layers as well as in the output layer.

Activation function decides, whether a neuron should be fired or not, by calculating the weighted sum of input and a bias term. Activation function can be linear or non-linear transformation, applied to the input signal and the output from the activation function is fed to the next layer of neuron as input.

Basic building blocks of neural network training

Before moving to different activation functions and their derivatives it is important to mention basic building blocks of neural network training. These blocks can be divided into forward pass, measurement of network's output error and the backward pass.

First, in forward pass the training instances are fed to the neural network. This results in a forward cascade of computations across the layers(using current set of weights) and network’s output prediction is calculated. Second, the network's output error is measured, that is the difference between the desired output and actual predicted output. Thirdly, in backward pass we back-propagate through each layer to measure error contribution from each connection and finally tweak the connection weights to reduce the error.

Why we are interested in derivative of activation functions?

When we implement back-propagation for a neural network we need to compute the gradient or the derivative of the activation functions. Therefore, one of the primary considerations while choosing activation functions is that it should be differentiable.

Commonly used activation functions and their derivatives

1. Sigmoid Activation Function

Sigmoid
Sigmoid function and its derivative

Sigmoid function has a characteristic “S”-shaped curve or sigmoid curve, is continuous, is differentiable, has non-zero derivative everywhere and takes real value as input and gives output between 0 and 1. However, as shown in the plots of the sigmoid (or logistic activation) function we can see that when input becomes large (negative or positive) the function saturates at 0 or 1, with the derivative extremely close to zero. Also the sigmoid activation function never has a gradient of more than 0.25.

Thus when the back-propagation kicks in, with sigmoid function we virtually have no gradient to propagate back through the network, and whatever little gradient exists, keeps getting diluted as back-propagation progresses down from top layers to lower layers. Therefore, sigmoid activation function is very prone to vanishing gradient problem and comes under class of saturating activation functions.

2. Hyperbolic Tangent (Tanh) Activation Function

Hyperbolic Tangent
Hyperbolic Tangent and its derivative

Hyperbolic Tangent (tanh) function is also S-shaped, continuous, and differentiable, but its output value ranges from -1 to +1, which tends to make each layer output more or less centered around 0. The tanh function is better than sigmoid function because it has gradient of 1 near the origin. However, similar to sigmoid activation function when input becomes large (negative or positive) the tanh function saturates at -1 or +1, with the derivative extremely close to zero. Thus like sigmoid function, tanh activation function suffers from vanishing gradients problem and comes under the class of saturating activation functions.

3. Rectified Linear Unit (ReLU) Activation Function

Rectified Linear Unit
ReLU and its derivative

The output of the ReLU is linear if the input is positive, it is zero otherwise. Range of ReLU is [0, inf). ReLU activation function is continuous, but not differentiable at x = 0. In practice the ReLU works very well as it has several advantages over other activation functions. It is faster to compute as compare to other activation functions (as it only uses a max function). It provides sparsity in the model (as capable of outputting true zero value). ReLU activation function does not saturate for large positive input values. Further, the derivative or gradient of ReLU has a constant value when x > 0 and this results in reduced likelihood of vanishing gradient problem.

However, ReLU activation function is not perfect and it suffers from a problem known as dying ReLUs. During training some of the neurons effectively die, that is they stop outputting anything other than 0. This happens when a neuron’s weights get updated such that weighted sum of neuron’s input is negative. When this happen the neuron is unlikely to come back to life since the gradient of the ReLU function is 0 when its input is negative. To solve this problem there is another variant of ReLU known as leaky ReLU, which is explained next.

4. Leaky Rectified Linear Unit (leaky ReLU) Activation Function

Leaky ReLU with alpha=0.01
Leaky ReLU with α = 0.01 and its derivative

Leaky ReLU can also be defined as max(αx, x). The hyper-parameter alpha (α) defines how much the function leaks. Alpha is the slope of the function for x < 0 and is typically set to 0.01. This small slope ensures that leaky ReLUs never die; they can go into long coma, but they have chance to eventually wake up. Further, the derivative or gradient of Leaky ReLU (considering α = 0.01) has a constant value of 1 when x > 0 & 0.01 when x < 0. Thus leaky ReLU avoids the vanishing gradient problem.

5. Exponential Linear Unit (ELU) activation function

Exponential Linear Unit (ELU)
Exponential Linear Unit with a=1 and its derivative

Exponential linear unit or ELU takes on negative values when x < 0 and this allows neurons average output to be closer to 0. The hyper-parameter (a, sometimes also denoted as α) defines the value that the ELU function approaches when x is large negative number. It is usually set to 1, but can be tuned as any other hyper-parameter. Further, ELU activation function has non-zero gradient for x < 0 and this solves the dying neurons problem. Moreover, the function is smooth everywhere, including around x=0. This helps to speed up the learning process.

The major drawback of ELU is that it is slower to compute as compare to ReLU and its variants. This is due to the use of computationally expensive exponential function in ELU. However, during training this is compensated by faster convergence rate. But during test time an ELU network will be slower than the ReLU neural network.

6. Scaled Exponential Linear Unit (SELU) activation function

Scaled Exponential Linear unit
SELU activation function with α=1.6733, λ=1.0507 and its derivative

SELU is scaled variant of ELU activation function. It uses two fixed parameters α and λ, and the value of these is derived from the inputs. However, for standardized inputs (mean of 0 and standard deviation of 1) the suggested values are α=1.6733, λ=1.0507.

The major advantage of using SELU is that it provides self-normalization (that is output from SELU activation will preserve the mean of 0 and standard deviation of 1) and this solves the vanishing or exploding gradients problem. SELU will provide the self-normalization if: (a) the neural network consists only a stack of dense layers, (b) all the hidden layers use SELU activation function, (c) the input features must be standardized (having mean of 0 and standard deviation of 1), (d) the hidden layers weight must be initialized with LeCun normal initialization, and lastly, (e) the network must be sequential.

Which activation function you should use in hidden layers?

There is no one answer to this question and will depend on the problem at hand. However, one can start with SELU > ELU > Leaky ReLU > ReLU > tanh > sigmoid. Moreover, if one cares about the run-time then you may use Leaky ReLU. Also if your neural network architecture prevents you to meet the SELU’s self-normalizing conditions then using ELU might give better results.

There are several other activation functions one can experiment with, such as Parametric leaky ReLU, in which α (degree of how much function leaks) is tuned as parameter and can be modified while doing backpropagation like any other model parameter (example - weights and biases). Another variant of ReLU is Randomized leaky ReLU (RReLU), in which α is chosen randomly during training process and fixed to an average value during testing.

Note: For writing this blog I have referenced deep learning books and research papers.

--

--