Activation Functions in Neural Networks
What are Activation Functions?
Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network after it calculates a “weighted sum(Wi)” of its input(xi), adds a bias and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction.
Important use of any activation function is to introduce non-linear properties to our Network.
All the input Xi’s are multiplied with their weight Wi’s assigned to each link and summed together along with Bias b .
Activation function Types :
· Linear Function
· Non-Linear Function
Linear Functions
A linear activation function takes the form y=mx+c. It takes the inputs (Xi’s), multiplied by the weights(Wi’s) for each neuron, and creates an output proportional to the input. The output of the functions will not be confined between any range. As the function follows a linear pattern it is used in regression problems.
Non- Linear Functions-
The Nonlinear Activation Functions are the most used activation functions. It makes it easy for the model to generalize or adapt with variety of data and to differentiate between the output. It allows the model to create complex mappings between the network’s inputs and outputs, which are essential for learning for neurons and solve complex business problems.
Types of Non- Linear Functions:
· Sigmoid
· TanH
· Relu
· Leaky Relu
· Swish Relu
· ELU
· Pre-relu
Sigmoid:
· The output of sigmoid function(y) is always between 0 and 1.
· Used in hidden layers and in output layer in case of shallow Neural networks.
· Derivate of Sigmoid function f(x) will always be between 0 to 0.25.
· Suffers from vanishing gradient problem.
· The data is not zero centered which makes convergence harder and computionally heavy.
Used in Classification problems.
TanH-
· Tanh transforms the value of y between -1 to 1.
· The derivate of tanh function f(y) is always between 0 and 1.
· Data is zero centric.
· Convergence is easier.
· Faces vanishing gradient problem in deep Neural Networks.
Tanh is preferred over Sigmoid.
ReLU: (Rectified Linear Unit)
ReLU is the non-linear activation function that has gained popularity
· ReLU transforms the values between 0 to max(x). It transforms the negative values to 0.
· Derivate of function f(x) will be 0 or 1.
· Solves vanishing gradient problem.
· It is computationally faster and converges easily.
· Faces Dead Neuron Problem.(During back propogation if the weights are negative then the wold=wnew and the neuron is a dead neuron)
· Data is not zero centric.
Leaky ReLU:
Leaky ReLU function is nothing but an improved version of the ReLU function. Instead of defining the Relu function as 0 for x less than 0, we define it as a small linear component of x. Usually to negative values of x a value of 0.01 is introduced.
· It solves dead neuron problem.
· Values of y will be max(0.01*y to y).
· Derivate of y will be between 0 and 1.
· It is computationally easier.
ELU: (Exponential Linear Unit)
Exponential Linear (ELU) has a extra alpha constant which should be positive number. If value of x is less than 0 then a alpha value is added to the negative number. alpha is greater than or equal to 0.
· Solves problem of Dead Neurons.
· Data is zero centered.
· Computationally expensive
· Slow convergence due to alpha terms.
· Derivative will never come to 0.
PReLu ( Parametric ReLU):
Instead of multiplying x if case of x<0 with a constant term with x in PReLU we multiple x with a alpha which is an trainable value( It changes values according to layers)
If alpha is 0.01 it will be leaky relu, If alpha is 0 is it is relu. In PReLU the alpha values is trained dynamically which makes it different from other ReLU functions.
· The value of alpha is generally between 0 and 1.
· Solves problem of dead neuron.
Swish ReLU: ( A self gated Function)
Y=x*sigmoid(x)
· Works with deep Neural Networks having layers>40.
· Used in self gating and LSTM.
· It is very computationally expensive.
· Solves Dead Neuron Problem.
How to Choose Activation Functions?
Choosing an activation function is an hyper parameter and is decided on trail and error keeping below parameters in mind
· Linear activation function is used to solve regression problems.
· Sigmoid is used in classification problems in case of shallow neural networks.
· A combination of TanH and Sigmoid is used in deep Neural Networks.
· ReLU is widely used in most of the Neural Networks in hidden layers. Moving forward various variants of ReLU can be used to solve dead neuron problems of ReLU.
I hope I’ve given you some basic understanding on activation functions.
Let’s connect on :
LinkedIn : https://www.linkedin.com/in/prerna-nichani
Thankyou for Reading!