Introduction to Different Activation Functions for Deep Learning

The Idea of Neural Networks was first introduced way back in 1950s, but it wasn’t until 2012 that they come to action. Even application of Optimization Algorithm(Gradient Descent) in 2006 by Hinton, wasn’t giving good results, it was introduction and usage of Activation functions, which revolutionized Deep Learning Research.

There are various kind of Activation Functions that exists, and Some Researchers are still working on finding better functions, which can help networks to converge faster or use less layers etc. Lets go through each of them:

1. Sigmoid Activation Function: Sigmoid Function is one the special function in Deep Learning Field, thanks to its simplification during Back Propagation. As we can see in this image, it:

(a.) Range from [0,1].

(b.) Not Zero Centered.

(c.) Have Exponential Operation (Its Computationally Expensive!!!)

The Main Problem we face is because of Saturated Gradients, as the Function ranges between 0 to 1, the values might remain constant, thus the gradients will have very less values. Therefore, no change during gradient descent.

2. Hyperbolic Tangent Activation Function(tanh): Hyperbolic Tangent also have the following properties:

(a.) Ranges Between [-1,1]

(b.) Zero Centered

tanh can be considered as a good example in case when input>0, so the gradients we will obtain will either be all positive or negative, which can led to explosion or vanishing issue, thus usage of tanh can be a good thing.but this still faces the problem of Saturated Gradients.

3. Rectified Linear Unit Activation Function (ReLU): ReLU is the most commonly used Activation Functions, because of its simplicity during backpropagation and its not computationally expensive. It has following properties:

(a.) It doesn’t Saturate.

(b.) It converges faster than some other activation functions.

But we can face an issue of dead ReLU, for Example if:

w>0, x<0. So, ReLU(w*x)=0, Always.

4. Leaky ReLU: Leaky ReLU can be used as improvement over ReLU Activation function. It has all properties of ReLU, plus it will never have dead ReLU problem.

We can consider different multiplication factor to form different variations of Leaky ReLU.

5. ELU(Exponential Linear Units): ELU is also a variation of ReLU, with better value for x<0. It also have same properties as ReLU along with:

(b.) Closer to Zero mean Outputs than Leaky ReLU.

(c.) More Computation because of Exponential Function.

6. Maxout: Maxout has been introduced in 2013. It has property of Linearity in it. So, it never saturates or die. But is Expensive as it doubles the parameters.

7. KAFNETS: Most neural networks work by interleaving linear projections and simple (fixed) activation functions, like the ReLU function. A KAF is instead a non-parametric activation function defined as a one-dimensional kernel approximator:

where:

• The dictionary of the kernel elements is fixed by sampling the x-axis with a uniform step around 0.

KAFNETS (Link) gives promising Results, we have tested them for One-shot Learning as mentioned in Article.

Mostly, Neural Networks go for different variations of RELU for its simplicity and easy computation both during forward and backward. But, in certain Cases Other Activation Functions gives us better results, Like Sigmoid is used at end layer, when we want our outputs to be squashed between [0,1], or tanh is being used in RNNs and LSTMs.