# Activation Function for Multi-layer Neural Networks

Recently I was having a word with Senior Data Scientist working for a well-funded e-commerce firm about the intuition about the activation functions and is there a generic approach to choose one. I was quite surprise to see the approach used was trial-and-error to do so, see which ones yields better result.

And as early as yesterday I was having interesting interaction with the founder of the Edge Computer Vision Company about how in developing the AI models intuition, formulating the problem statement and street smartness plays an important role. This triggered to restart the article I wanted to complete on Activation Function Intuition and choosing right one.

Let’s get the intuition about what is happening in each layer of the Neural Network (NN). Each Neuron in the NN can be thought two units:

1) Aggregator Unit : Aggregate all inputs plus the bias

2) Activation Unit : The activation functions transforms the input signal to linear or non-linear output

Technically, we can use any function as an activation function in multi-layer NN as long as it is differentiable. This is very important as most of the optimization algorithms used today work on various gradient techniques and hence differentiability is key.

We can use the linear activation and sum of linear functions will yield linear function, this will be of limited help for the complex data-set in present use cases. Intuitively this will not be of much help as we need to introduce have non-linearity so that we can have better fitting to the complex input data.

With this in as context let us dwell deep in various non-linear activation functions

1) Logistic Activation Function:

• It is special case of sigmoid function and use to model the probability that sample x belongs to the positive class in a Binary Classification task.
• For multi-class we can’t interpret the output of this activation function as probabilities of each class since sum of them will not necessarily add up to 1.
• For highly negative inputs the output will be close to zero. The intuition to this, is the learning of the neural network will be very slow and might get trapped in local minima during training phase.

2) Softmax Activation Function:

• The limitation of logistic activation function in multi-class classification task can be overcame if we normalized the term in denominator. This is the intuition for softmax function
• Softmax function is a soft form of the argmax function and this nature of being soft (having smooth gradient) helps in computing meaningful class probabilities in multiclass settings
• If we calculate the net input and use this as activation function, the value we get can be interpreted as directly as a class probability as the predicted class probabilities will sum upto 1.

3) Hyperbolic Tangent Activation Function :

• Known commonly as tanh, it is most widely used in the hidden layers because of broader output spectrum and ranges of interval (-1, 1). This helps in improving convergence of backprop algorithm
• Intuitively , tanh is similar to logistic function in the sense the curve looks very similar with broader or 2 times output space of logistic function

All of the above activations functions belongs to one or the other case of sigmoid family. This family of activation function do face the vanishing gradient problem.

As the value of the aggregation units becomes larger the derivative of the activations with respect to the input diminishes. The implication, the training phase becomes very slow as the gradient terms are very close to zero and eventually the learning is minimum or slow.

Rectified Linear Unit (ReLU) activation function addresses this issue and it has become de-facto choice for many deep learning neural networks.

4) Rectified Linear Unit (ReLU) :

• It is still a non-linear
• Its derivative with respect to its input, is always 1 for positive input value, hence it solves the problem of vanishing gradient
• It is more suitable for learning complex functions in deep learning domain

I hope this article summarizes why the need of non-linear activation function and utility of different types of activation functions.

Finally a quick cheat sheet for reference