Activation functions — why is there more than 1?

Mohamed Shawky
Udacity PyTorch Challengers
3 min readJan 3, 2019

In neural networks, activation functions add nonlinearities to the output of the neuron and there are many activation functions. Before we answer why we there are many activation functions used in the units of the feed-forward neural network. We should ask ourselves why we need activation functions in the first place.

Why we need an activation function?

Feed Forward Neural Network Source: cs231n

For the network above we know that feed-forward equation the equations with no activation function will be as follows
h1 = W1 x+b1
h2 = W2 h1+b2
output =W3 h2+b3
if we want to write the output in terms of the input it will be
output=W3(W2(W1 x+b1)+b2)+b3
and in simpler terms it will
output = W3W2W1 x +W3W2b1+W3b2+b3
output = Wbig x+ Bbig where
Wbig = W3W2W1
Bbig = W3W2b1+W3b2+b3
As we can conclude A neural network with many layers with no activation functions is as the same as A neural network with just one output layer so the activation function adds non-linearity to the network so it can capture complexities and non-linear relationships in the dataset.

Sigmoid

Sigmoid Activation function

Sigmoid is one of the first functions used in Machine Learning technique Logistic Regression and it is used as an activation function for capturing nonlinearities. it has the mathematical form of σ(x)=1/(1+e^−x) and its derivative is σ(x)* 1-σ(x).

Pros

  • squishes the values in range 0–1 so large positive values will be 1 and large negative values will be 0 which provides great interpretation for firing rate of the neuron from 0 which means no firing to 1 fully firing

Cons

  • Vanishing gradient problem at extremes (very large positive numbers or very large negative numbers
  • Outputs are not zero centered and that means the gradient updates will be either all positive or all negative which cause zigzagging gradient updates for weights

So there is another activation function avoiding the drawbacks of the sigmoid which is tanh (hyperbolic tangent).

Hyperbolic Tangent activation function-tanh

Hyperbolic tangent activation function-tanh Source: www.cs.cmu.edu

The hyperbolic tangent function squishes the input in the range [-1,1] and it is usually preferred to sigmoid

Pros

  • zero centered unlike sigmoid

Cons

  • Still, suffering from vanishing gradient problem

Rectified Linear Unit -ReLU

ReLU activation function Source: www.casact.org

ReLU computes the function as f(x)=max(0,x) and it is one of the most used activation function as it solves the problem of vanishing gradient so people can train deeper neural networks.

Pros

  • Very simple to implement
  • Solves the vanishing gradient problem

Cons

  • Only used in hidden layers
  • Some units will never activate since if the input is negative the output will be zero especially if the learning rate is too high (proper setting of learning rate will reduce that issue

Leaky ReLU

Leaky ReLU Source: Andrew Ng’s coursera deep learning course

Leaky ReLU solves the problem of dying ReLU by adding a slope to the slope for the line when x is negative and sometimes this slope is learnable parameter like PReLU which stands for Parametric ReLU.

Pros

  • Solves the “dying ReLU” problem

Cons

  • Unlike PReLU the slope is not learnable by the neural network

To sum up, the reason for having more than 1 activation is that each activation is trying to overcome the shortcomings of the other activation functions.

--

--