Neural Network -Activation functions
This post will help you understand the most common activation functions used in machine learning, including deep learning.
Prerequisite: It is advisable to have knowledge on a few machine learning algorithms and a basic understanding of artificial neural networks.
For a basic understanding of artificial neural network without diving into the mathematics.
What is an Activation function Neural networks?
Activation function helps decide if we need to fire a neuron or not . If we need to fire a neuron then what will be the strength of the signal.
Activation function is the mechanism by which neurons process and pass the information through the neural network
Why do we need an Activation function in Neural network?
In neural network, z is the product of the input node and weight for the node plus the bias. Equation for z looks very similar to linear equation and can have value that can ranges from +infinity to -infinity
If the neuron value can range from -infinity to + infinity then we will not be able to decide if we need to fire the neuron or not. This is where Activation helps us to solve the problem.
If z is linear in nature then we will not be able to solve complex problems. This is another reason why we use activation functions.
There are different types of activation functions and some very common and popular ones are
- Threshold or Binary step activation function
- Tanh or Hyperbolic tangent
- ReLU and
- Leaky ReLU
Why do we need so many different activation function and how do I decide which one to use?
let’s go over each of the activation function and understand where they are best used and why. This will help us decide which activation function to use in different scenarios.
Threshold or Binary Step function
This is the simplest function and can be thought of as a yes or no function.
If the value of z is above the threshold value then activation is set to 1 or yes and the neuron will be fired.
If the value of z is below the threshold value then activation is set to 0 or no and the neuron will not be fired.
They are useful for binary classification.
Sigmoid Activation Function
Sigmoid function is a smooth nonlinear function with no kinks and look like S shape.
It predicts the probability of an output and hence is used in output layers of a neural network and logistics regression
As the probability ranges from 0 to 1, so sigmoid function value exists between 0 and 1.
But what if we want to classify more than a yes or no? what if I want to predict multiple classes like predicting weather that can be sunny, rainy or cloudy?
Softmax activation helps with multiclass classification
Softmax Activation Function
Sigmoid activation function is used for two class or binary class classification whereas softmax is used for multi class classification and is a generalization of the sigmoid function.
In softmax, we get the probabilities of each of the class whose sum should be equal to 1. When the probability of one class increase then the probability of other classes decreases, so the class with highest probability is the output class.
Example: when predicting weather, we may get output probabilities as 0.68 for sunny weather, 0.22 for cloudy weather and 0.20 for rainy weather. In that case we take output with max probability as our final output. In this case we will predict weather to be sunny.
Softmax calculates the probability of each target class over the probability of all possible target classes.
Hyperbolic Tangent or Tanh Activation Function
For hyperbolic tanh function, the output is centered at 0 and output range is between -1 and +1.
Looks very similar to sigmoid.In fact hyperbolic tanh is scaled sigmoid function. Gradient descent is stronger for tanh compared to sigmoid and hence is preferred over sigmoid.
Advantage of tanh is that negative input will be mapped as strongly negative, zero input will be mapped to near zero which does not happen in sigmoid as the range for sigmoid is between 0 and 1
Rectifier Linear Unit-ReLU
ReLU is non linear in nature which means it slope is not constant. Relu is non linear around zero but the slope is either 0 or 1 and thus having limited non linearity.
Range is from 0 to infinity
ReLU gives an output same as input when z is positive. When z is zero or less than zero it gives an output of 0. Thus ReLU shuts off the neuron when input is zero or below zero.
All deep learning models uses Relu however it can be used only for the hidden layer as it induces sparsity. Sparsity refers to number of null or “NA” values.
When the hidden layers are exposed to a range of input values, rectifier function will lead to more zeros resulting in less neurons getting activated and that would mean less interactions across neural network.
ReLU turn on or off the neurons more aggressively than sigmoid or tanh
Challenge with Relu is that the negative values become zero decreasing the model’s ability to train the data properly. To solve this problem we have Leaky ReLU
In leaky ReLU we introduce a small negative slope so it does not have a zero slope. This helps speed up training.
Range for Leaky ReLU ranges from -infinity to +infinity
This will give a good understanding of different activation functions.