Activation functions and their advantages & disadvantages

Santosh Singh
5 min readMar 19, 2023

In this article ,I am going to share types of activation functions, use of activation function and their advantages & disadvantages.

Use of Activation function:- This is used in the neural network to restrict, filter & to normalise the data input. Main advantage is to achieve non-linearity of data set.

Type of Activation functions

a.Linear function( i.e Y=MX +C):- This is not preferred as these functions are not good for complex use case/problems.

b. Binary or Step function:- These function have limited output(ON, OFF), does not help during backward propagation learning as dy/dx =0 for these functions.

c. Non-Linear function(i.e Sigmod, Tanh, ReLU etc):- These functions helps to understand the complexity of data , we can also find differential of these functions which helps to adjust the weight of neural network during backward propagation/learning process.

List of Some Activation functions

  1. Sigmoid or Logistic activation function :- Output of Sigmoid function ranges between 0 to 1 and first differential output of Sigmoid function ranges between 0 to 0.25
f(x) = 1 / (1 + exp(-x)) , Result ranges between 0 to 1

f'(x) = f(x)(1- f(x)) , Result ranges between 0 to 0.25
cz
This is the output graph of Sigmoid function, we can see result ranges between 0 to 1.

Advantage :-

a) As output of sigmoid function ranges between 0 to 1 hence this helps to normalise(or scale data) and to achieve non linearity of the data set.

b) As we can also perform differentiation which help the NN to learn during backward propagation.

Disadvantage :-

a) As Sigmoid function is not a zero centric function ( does not cross origin) so it always gives gradient into +ve direction, which is not good and make optimiser hard to learn/adjust weight of NN.

b) Vanishing gradient issue: As the differential value of Sigmoid function ranges between 0 to 1, hence effect of gradient decreases layer by layer which cause issue during learning of neural network or very less impact on first layer of the Neuron.

c) As Sigmoid function is combination of exponential function (exp(x)), hence it required lots of computational power which makes this more expensive.

2. Tanh or Hyper tangent Act. function :- This is zero centric function(both side of origin).

f(x) = (epx(x) - exp(-x))/ (exp(x) + exp(-x))

f'(x) = 1- (f(x))2 , Result ranges between 0 to 1

We can also reprasent Tanh function in form of Sigmoid like below

f(tanh) = 2 * Sigmoid(2x) -1
Tanh output grapgh and result ranges between -1 to 1

Advantage :-

a) Zero centric function, hence this will give gradient in both +ve and -ve direction.

b) Modified version of Sigmoid and comparatively better sigmoid function.

Disadvantage:

a) For large layers this function also faces vanishing gradient issue.

b) As this is modified version of sigmoid hence bit more expensive computational power .

3. ReLU (Rectified Linear Unit) Act. function :-

ReLu Function and graph representation
ReLu graph(in blue) and derivative of ReLu graph(in green)

Advantage :-

a. This overcomes the vanishing gradient issue as differential is going to give the output range between 0 to 1.

b. Main advantage of ReLu, it does not activate all neurons at a time. if the output of one neuron is zero then this will activate that neuron.

Note:- Activation or deactivation of neuron does not depend on +ve or -ve value of individual input because anyways we are going to multiply the input with weight value which itself that can be +ve or -ve.

Disadvantage :-

a. This is not zero centric function so it gives only +ve direction gradient

b. For -ve input it does not do anything, which is not good for learning of Neural Network.

c. Dead neuron is the biggest issue of ReLu as it deactivates the neurons for -ve data set & do not provide learning -ve direction.

4. Leaky ReLU Act. function :-

This is modified version of ReLU function with constant value of alpha= 0.01

Leaky ReLu function and graph representation

Advantage :-

a. Overcomes the dead neuron issue by adding constant slope alpha.

b. Less computational expensive compare to sigmoid and tanh function

Disadvantage : -

a. It does not provide much leaning for -ve dataset/input as alpha value is constant.

5. P ReLU (Parametric ReLu) Act. function :-

This is modified version of Leaky ReLu by multiplying the x with alpha.

where alpha is not a constant we can be adjust based on learning which makes this function better than leaky.

Advantage : -

a. Slight advantage over leaky function as we can change learnable parameter alpha.

b. Less expensive compare to exponential act. functions.

Disadvantage :-

a. Does not provide more learning for -ve dataset.

6. ELU (Exponential Linear Unit) Act. function :-

This is zero centric function and better combination of ReLu and exponential functions.

ELU function and derivative function
ELU & its Derivative graph

Advantage :-

a. Zero centric function , hence we get gradient in both direction which makes learning better for NN.

b. Address issue of Dying Neuron.

Disadvantage :-

a. More computational expensive for -ve dataset as exponential calculation involved.

Note :- No activation function is good or bad, it depends on use case, data set and output. Hence which activation function we should use its an experimental thing.

Happy learning !!

--

--

Santosh Singh

Artificial intelligence | Data scientist | Machine learning engineer | Deep learning & NLP