Activation Function and its types
Before diving into activation function lets talk a little bit about Artificial Neural Network(ANN) and why do we use activation function in ANN.
Q. What is Artificial Neural Network?
Ans. ANN is a very powerful Machine Learning technique which try to mimic a human brain and how it functions. ANN consists of multiple layers with each layer containing multiple neuron.


As you can see in the image above Activation function is used with each and every neuron in ANN which takes sum of product of weight and inputs added with bias as the input to the activation function and activation function produces the output which is then used as an input by the neuron in next layer in the stack i.e. in the above example w0x0+w1x1+w2x2+b is given as the input to activation function and activation function produces an output.

Q. What does Activation Function do with the input given to it?
Ans. Activation function use to add non linearity to the linear input given to it in order to find the relationship between the training set and its labels. It is done by taking the input to the higher dimension so that the relation between the training data and the label can be found.
Q. Why do we use Activation Function?
Ans. If we do not apply a Activation function then the output signal would simply be a simple linear function.A linear function is just a polynomial of one degree. Now, a linear equation is easy to solve but they are limited in their complexity and have less power to learn complex functional mappings from data. A Neural Network without Activation function would simply be a Linear regression Model, which has limited power and does not performs good most of the times. We want our Neural Network to not just learn and compute a linear function but something more complicated than that. That is why we use Artificial Neural network techniques such as Deep learning to make sense of something complicated ,high dimensional,non-linear -big datasets, where the model has lots and lots of hidden layers in between and has a very complicated architecture which helps us to make sense and extract knowledge form such complicated big datasets.
Q. Why is Non Linearity required?
Ans. Non linear functions are the function having degree more than one and they have a curve when we plot a Non-Linear function. Now we need a Neural Network Model to learn and represent almost anything and any complex function which maps inputs to outputs which can be done by apply a Activation function f(x) so as to make the network more powerful and add ability to it to learn something complex and complicated form data and represent non-linear complex arbitrary functional mappings between inputs and outputs.
Non Linear Activation Function
There are various types of Non Linear Activation Functions, some of them are:
- Sigmoid Activation Function
- tanh — Hyperbolic tangent function
- ReLu - Rectified linear units
- softmax
Sigmoid Activation Function
Sigmoid Activation Function is of the form f(x) = 1 / (1 + exp(-x)). It ranges in between 0 and 1. It is an S shaped curve. It is mainly used for the models where we need to predict the probability as an output. Since probability exist in the range of 0 and 1.

Problem with Sigmoid Function
2. Output is not 0 centred which cause the gradient updates move too far in the different direction which makes optimisation harder
tanh — Hyperbolic Tangent Function
Tanh activation function is in the form f(x) = (2/(1+exp(-2x)))-1. Its output is zero centred since its range in between -1 and 1. It has an S shape curve. Optimisation in this activation function is easier than that of sigmoid function hence in practice it is preferred over sigmoid function.



Benefit of tanh function over sigmoid activation function
optimisation problem of sigmoid function has been solved in tanh function
Problem with tanh function
ReLu — Rectified linear units
ReLu activation function is in the form f(x) = max(0,x). It output 0 if x<0 and outputs 1 if x≥0. It ranges from 0 to +∞.

This function will allow only the maximum values to pass during the forward propagation as shown in the graph above.
Benefit of ReLU
It solve Vanishing gradient problem
The draw backs of ReLU
When the gradient hits zero for the negative values, it does not converge towards the minima which will result in a dead neuron while back propagation. This problem can be solved by modifing ReLu to Leaky ReLu in which instead of being 0 for all negative values, it has a constant slope (less than 1)
Softmax
This is an activation function which convert the input to the neuron in the range of 0 and 1 in such a way that the total sum of the outputs is equal to 1. This function is mostly used where we have to define the multi class classification where it gives the output in the form of the probability of occurrence of a class. It is mostly used in the final layer of the neural network.


Conclusion
Activation function is probably the strongest arsenal of neural networks when it comes to learning capabilities.
