An activation function plays a vital role in neural networks. It is also called a transfer function. Its aim is to introduce non-linear transformation to learn the complex underlying patterns in the data. It should be differentiable as well as should follow computational inexpensiveness. And its output needs to be zero centered so that it would help in the calculated gradients to be in the same direction and shifting across.
The activation function is represented as f(x) where x=(input*weights)+bias. Now, let’s look into commonly used activation function.
The sigmoid function can be defined as
- It scales the value between 0 and 1.
- It has an S-shaped curve.
- It is centered on 0.5.
- It's Differentiable and Monotonic.
- It is also known as a Logistic function.
The tanh function can be defined as
- It scales value between -1 to +1.
- It also resembles an S-shaped curve.
- It is centered on 0.
- It’s Differentiable and Monotonic.
Rectified Linear Unit Function
The ReLU function is expressed as
- It is a piecewise function.
- It returns zero when the value of x is less than zero otherwise returns one.
- Its snag of being zero for all negative values is a problem called Dying ReLU.
Leaky ReLU Function
This function is expressed as
- It is a variant of the ReLU function.
- It has a small slope for all negative values (α).
- α is mostly 0.01.
- Parametric ReLU Function:- Here, the parameter is sent to a neural network and that network learns the optimal value of α.
- Randomized ReLU Function:- Here, the random value of a is set.
Exponential Linear Unit Function
This function can be expressed as follows
- It is similar to the Leaky ReLU function.
- It has a small slope for negative values.
- It is centered on zero.
The swish function is expressed as
- It was introduced by Google.
- It performs better than ReLU.
- It is non-monotonic.
- α(x) is a sigmoid function.
- It can be reparametrize as below.
This function can be defined as
- It is a generalization of the sigmoid function.
- Mostly applied to the final layer of the network and in multi-class classification tasks.
- The sum of softmax values is always one.
- It converts their inputs t probabilities as shown below.