Activation functions — why is there more than 1?
In neural networks, activation functions add nonlinearities to the output of the neuron and there are many activation functions. Before we answer why we there are many activation functions used in the units of the feed-forward neural network. We should ask ourselves why we need activation functions in the first place.
Why we need an activation function?
For the network above we know that feed-forward equation the equations with no activation function will be as follows
h1 = W1 x+b1
h2 = W2 h1+b2
output =W3 h2+b3
if we want to write the output in terms of the input it will be
output=W3(W2(W1 x+b1)+b2)+b3
and in simpler terms it will
output = W3W2W1 x +W3W2b1+W3b2+b3
output = Wbig x+ Bbig where
Wbig = W3W2W1
Bbig = W3W2b1+W3b2+b3
As we can conclude A neural network with many layers with no activation functions is as the same as A neural network with just one output layer so the activation function adds non-linearity to the network so it can capture complexities and non-linear relationships in the dataset.
Sigmoid
Sigmoid is one of the first functions used in Machine Learning technique Logistic Regression and it is used as an activation function for capturing nonlinearities. it has the mathematical form of σ(x)=1/(1+e^−x) and its derivative is σ(x)* 1-σ(x).
Pros
- squishes the values in range 0–1 so large positive values will be 1 and large negative values will be 0 which provides great interpretation for firing rate of the neuron from 0 which means no firing to 1 fully firing
Cons
- Vanishing gradient problem at extremes (very large positive numbers or very large negative numbers
- Outputs are not zero centered and that means the gradient updates will be either all positive or all negative which cause zigzagging gradient updates for weights
So there is another activation function avoiding the drawbacks of the sigmoid which is tanh (hyperbolic tangent).
Hyperbolic Tangent activation function-tanh
The hyperbolic tangent function squishes the input in the range [-1,1] and it is usually preferred to sigmoid
Pros
- zero centered unlike sigmoid
Cons
- Still, suffering from vanishing gradient problem
Rectified Linear Unit -ReLU
ReLU computes the function as f(x)=max(0,x) and it is one of the most used activation function as it solves the problem of vanishing gradient so people can train deeper neural networks.
Pros
- Very simple to implement
- Solves the vanishing gradient problem
Cons
- Only used in hidden layers
- Some units will never activate since if the input is negative the output will be zero especially if the learning rate is too high (proper setting of learning rate will reduce that issue
Leaky ReLU
Leaky ReLU solves the problem of dying ReLU by adding a slope to the slope for the line when x is negative and sometimes this slope is learnable parameter like PReLU which stands for Parametric ReLU.
Pros
- Solves the “dying ReLU” problem
Cons
- Unlike PReLU the slope is not learnable by the neural network
To sum up, the reason for having more than 1 activation is that each activation is trying to overcome the shortcomings of the other activation functions.