ACTIVATION FUNCTION IN DEEP LEARNING
Fig: Activation function for neural network
Artificial Neural Network (ANN):-
Artificial Neural Network is a awesome machine learning technique which resembles a human brain and how it works.A neuron operates by receiving signals from other neurons through connections, called synapses. Synapses are fired from one neuron to another which human to learn and remember things in daily life.
Use of Activation function:-
Activation functions play a significant role for complex functional mappings between the inputs and output variable in Artificial neural network. They introduce non-linear properties in our Network.Generally in A-NN we compute the sum of products of inputs say A and their corresponding Weights (W) and apply a Activation function f(x) to it to get the output of that layer and pass it as an input to the next layer.
Need of Activation Function:-
Neural networks have to implement complex mapping functions hence they need activation functions that are non-linear in order to bring in the much needed non-linearity property that enables them to approximate any function. A neuron without an activation function is equivalent to a neuron with a linear activation function given by.
Such an activation function adds no non-linearity hence the whole network would just be equivalent to a single linear function. That is to say, having a multi-layer linear network is equivalent to one linear node.
Thus it makes no sense building a neural network with linear activation functions, it is better to just have a single node do the job. To make matters worse a single linear node is not capable of dealing with non separable data, that means no matter how large a multi-layer linear network can be it can never solve the classic Addition and Subtraction problem or any other non-linear problem.
However, activation functions are decision functions, the ideal decision function is the Heaviside step function. But this is not differentiable hence more smoother versions such as the sigmoid function have been used merely because of the fact that they are differentiable which makes them ideal for gradient based optimization algorithms.
Why do we need non-linearity?
We need a Neural Network Model to learn and represent almost any arbitrary complex function which maps inputs to outputs. Hence they are also called as Universal Function Approximators. It means that they can learn any complex function.
Hence it all comes down to this, we need to apply a Activation function f(x) so as to make the network more stronger and flexible to it to learn complicated datasets and represent non-linear complex functional mappings between inputs and outputs. Hence using a non linear Activation we can generate non-linear mappings from inputs to outputs.
Commonly used activation functions:-
3. Rectified linear units
Sigmoid : It is a activation function of form f(x) = 1 / 1 + exp(-x) . Its range is between 0 and 1.
Sigmoid usage in neural network are as follows:- 1. Activation function that transform linear inputs to nonlinear outputs. 2. Bound output to between 0 and 1 so that it can be interpreted as a probability. 3. Make computation easier than arbitrary activation functions.
But it suffers from vanishing gradient descent problem.
Tanh function: — tanh(x)=e2x−1e2x+1.tanh(x)=e2x−1e2x+1
As we can in above figure it’s output is zero centered because its range in between -1 to 1 i.e -1 < output < 1 . Hence optimization is easier in this method hence in practice it is always preferred over Sigmoid function . But still it suffers from Vanishing gradient problem.
It was recently proved that it had 6 times improvement in convergence from Tanh and sigmoid function. It’s just R(x) = max(0,x) i.e if x < 0 , R(x) = 0 and if x >= 0 , R(x) = x. Hence as seeing the mathematical form of this function we can see that it is very simple and efficient . Hence it avoids and rectifies vanishing gradient problem .
Most deep leaning model use Relu as a activation function in hidden layer. But its limitation is that it should only be used within Hidden layers of a Neural Network Model.
Hence for output layers we generally use Softmax function for a Classification problem to compute the probabilities of the occurrence.
Another problem with ReLu is that some gradients can be fragile during training and can die. It can cause a weight update which will makes it never activate on any data point again.
To fix this problem another modification was introduced called Leaky ReLu to fix the problem of dying neurons. It introduces a small slope to keep the updates alive.
Fig:-Leaky Relu function