Activation functions in Neural Networks
What is an Activation Function?
An activation function decides whether a neuron’s input is important to the neural network or not in the output prediction. It’s main function is to transform the summed input of the node into an output value that will be fed to the next layer.
Why is an Activation Function needed?
Activation function adds non-linearity to the neural network. Without activation function, a neuron performs only linear transformation on the inputs using the weights and bias. Thus our model will be just a linear regression model and will not be able to solve complex problems
What are the different activation functions?
Linear activation function
Linear activation function, also called the identity function or no activation, multiplies the weighted sum of inputs by 1. Thus it doesn’t transform the input and the output is same as the input.
Limitations
1. Not possible to use backpropagation as derivative of the function is constant.
2. All layers of the neural network will collapse into one. Even if there are 100 layers, the last layer will be a linear function of the first layer, essentially turning the network into just 1 layer.
Binary step function
Binary step function decides whether the output is 0 or 1 based on a threshold value. If the weighted sum is greater than threshold, it outputs 1 else 0.
Limitations
1. Cannot be used for multi-class classification problems
2. Hinders backpropagation process as the gradient is zero
Logistic Activation function
Logistic Activation function transforms the input into a value between 0 and 1. The larger the input(more positive), the closer the output will be to 1 and the smaller the input(more negative), the closer the output will be to 0.
Advantages
1. Commonly used in models where probability is the output.
2. Prevents jumps in output values since the function has a smooth gradient.
Limitations
1. Suffers from Vanishing Gradient Problem where the network is unable to backpropagate useful information
Tanh Function
Tanh function or hyperbolic tangent function is similar to logistic function with the only main difference being that the output of tanh function is between -1 and 1. The larger the input(more positive), the closer the output will be to 1 and the smaller the input(more negative), the closer the output will be to -1.
Advantages
1. Output can be mapped as strongly negative, neutral and strongly positive
2. Has gradient 4 times greater than that of logistic function — thus giving rise to bigger learning steps when training.
3. Symmetric around 0, leading to faster convergence.
Limitations
1. Suffers from Vanishing Gradient Problem where the network is unable to backpropagate useful information
Rectified Linear Unit(ReLU)
Rectified Linear Unit transforms the input into 0 if it is negative or returns the input itself if it is positive. Although it seems like a linear function, ReLU has a derivative function and allows complex relationships in the data to be learned.
Advantages
1. Doesn’t activate neurons with negative inputs, thus being computationally efficient.
2. Tends to show better convergence
3. Faster to compute than some other activation functions like logistic function
Limitations
1. Tends to blow up activation since there is no constraint to the output if the input is positive.
2. Dying ReLU problem — if too many activation get below zero, most of the neurons in a layer will output zero, creating dead neurons whose weights and biases are not updated.
Leaky ReLU function
Leaky ReLU is an improved version of ReLU function to solve the Dying ReLU problem. Instead of transforming negative values into 0 like ReLU, Leaky ReLU transforms it by multiplying with a small, non-zero constant parameter a(Normally 0.01).
Advantages
1. Prevention of Dying ReLU problem by allowing a small gradient for negative inputs
2. Faster to compute than some other activation functions like logistic function
Limitations
1. Sensitive to the parameter a — A value that is too small may result in slow convergence, while a value that is too large may result in unstable behaviour
2. Prediction for negative input values may not be consistent
How to choose the right activation function?
Choosing the right activation function is an important decision in the design of a neural network, as it can significantly impact the network’s performance. Some general guidelines for choosing an activation function are
- The characteristics of the data and the requirements of the task: The logistic function may be more suitable for tasks that involve binary classification, while the ReLU function may be more suitable for tasks that involve large, positive input values.
- The computational complexity: Activation functions with higher computational complexity may require more time and resources to compute, which can impact the overall performance of the network.
- The type of layer: ReLU activation function is mostly used in the hidden layers whereas Logistic and Tanh functions are mostly used in output layers.
- Trail and error: It is often a good idea to try out different activation functions and compare their performance on the specific task at hand. This can help to identify the best activation function for the task.