Activation functions — why is there more than 1?

Published in

Udacity PyTorch Challengers

3 min readJan 3, 2019

In neural networks, activation functions add nonlinearities to the output of the neuron and there are many activation functions. Before we answer why we there are many activation functions used in the units of the feed-forward neural network. We should ask ourselves why we need activation functions in the first place.

Why we need an activation function?

Feed Forward Neural Network **Source: cs231n**

For the network above we know that feed-forward equation the equations with no activation function will be as follows
h1 = W1 x+b1
h2 = W2 h1+b2
output =W3 h2+b3
if we want to write the output in terms of the input it will be
output=W3(W2(W1 x+b1)+b2)+b3
and in simpler terms it will
output = W3W2W1 x +W3W2b1+W3b2+b3
output = Wbig x+ Bbig where
Wbig = W3W2W1
Bbig = W3W2b1+W3b2+b3
As we can conclude A neural network with many layers with no activation functions is as the same as A neural network with just one output layer so the activation function adds non-linearity to the network so it can capture complexities and non-linear relationships in the dataset.

Sigmoid

Sigmoid is one of the first functions used in Machine Learning technique Logistic Regression and it is used as an activation function for capturing nonlinearities. it has the mathematical form of σ(x)=1/(1+e^−x) and its derivative is σ(x)* 1-σ(x).

Pros

squishes the values in range 0–1 so large positive values will be 1 and large negative values will be 0 which provides great interpretation for firing rate of the neuron from 0 which means no firing to 1 fully firing

Cons

Vanishing gradient problem at extremes (very large positive numbers or very large negative numbers
Outputs are not zero centered and that means the gradient updates will be either all positive or all negative which cause zigzagging gradient updates for weights

So there is another activation function avoiding the drawbacks of the sigmoid which is tanh (hyperbolic tangent).

Hyperbolic Tangent activation function-tanh

The hyperbolic tangent function squishes the input in the range [-1,1] and it is usually preferred to sigmoid

Pros

zero centered unlike sigmoid

Cons

Still, suffering from vanishing gradient problem

Rectified Linear Unit -ReLU

ReLU activation function **Source:** **www.casact.org**

ReLU computes the function as f(x)=max(0,x) and it is one of the most used activation function as it solves the problem of vanishing gradient so people can train deeper neural networks.

Pros

Very simple to implement
Solves the vanishing gradient problem

Cons

Only used in hidden layers
Some units will never activate since if the input is negative the output will be zero especially if the learning rate is too high (proper setting of learning rate will reduce that issue

Leaky ReLU

Leaky ReLU solves the problem of dying ReLU by adding a slope to the slope for the line when x is negative and sometimes this slope is learnable parameter like PReLU which stands for Parametric ReLU.

Pros

Solves the “dying ReLU” problem

Cons

Unlike PReLU the slope is not learnable by the neural network

To sum up, the reason for having more than 1 activation is that each activation is trying to overcome the shortcomings of the other activation functions.

Activation functions — why is there more than 1?

Why we need an activation function?

Sigmoid

Pros

Cons

Hyperbolic Tangent activation function-tanh

Pros

Cons

Rectified Linear Unit -ReLU

Pros

Cons

Leaky ReLU

Pros

Cons

Written by Mohamed Shawky