Introduction to Different Activation Functions for Deep Learning

The Idea of Neural Networks was first introduced way back in 1950s, but it wasn’t until 2012 that they come to action. Even application of Optimization Algorithm(Gradient Descent) in 2006 by Hinton, wasn’t giving good results, it was introduction and usage of Activation functions, which revolutionized Deep Learning Research.

There are various kind of Activation Functions that exists, and Some Researchers are still working on finding better functions, which can help networks to converge faster or use less layers etc. Lets go through each of them:

Image for post
Image for post
Different Activation Functions and their Graphs

(a.) Range from [0,1].

(b.) Not Zero Centered.

(c.) Have Exponential Operation (Its Computationally Expensive!!!)

The Main Problem we face is because of Saturated Gradients, as the Function ranges between 0 to 1, the values might remain constant, thus the gradients will have very less values. Therefore, no change during gradient descent.

2. Hyperbolic Tangent Activation Function(tanh): Hyperbolic Tangent also have the following properties:

(a.) Ranges Between [-1,1]

(b.) Zero Centered

tanh can be considered as a good example in case when input>0, so the gradients we will obtain will either be all positive or negative, which can led to explosion or vanishing issue, thus usage of tanh can be a good thing.but this still faces the problem of Saturated Gradients.

3. Rectified Linear Unit Activation Function (ReLU): ReLU is the most commonly used Activation Functions, because of its simplicity during backpropagation and its not computationally expensive. It has following properties:

(a.) It doesn’t Saturate.

(b.) It converges faster than some other activation functions.

But we can face an issue of dead ReLU, for Example if:

w>0, x<0. So, ReLU(w*x)=0, Always.

4. Leaky ReLU: Leaky ReLU can be used as improvement over ReLU Activation function. It has all properties of ReLU, plus it will never have dead ReLU problem.

We can consider different multiplication factor to form different variations of Leaky ReLU.

5. ELU(Exponential Linear Units): ELU is also a variation of ReLU, with better value for x<0. It also have same properties as ReLU along with:

(a.) No Dead ReLU Situation.

(b.) Closer to Zero mean Outputs than Leaky ReLU.

(c.) More Computation because of Exponential Function.

6. Maxout: Maxout has been introduced in 2013. It has property of Linearity in it. So, it never saturates or die. But is Expensive as it doubles the parameters.

7. KAFNETS: Most neural networks work by interleaving linear projections and simple (fixed) activation functions, like the ReLU function. A KAF is instead a non-parametric activation function defined as a one-dimensional kernel approximator:


KAFNETS (Link) gives promising Results, we have tested them for One-shot Learning as mentioned in Article.

Mostly, Neural Networks go for different variations of RELU for its simplicity and easy computation both during forward and backward. But, in certain Cases Other Activation Functions gives us better results, Like Sigmoid is used at end layer, when we want our outputs to be squashed between [0,1], or tanh is being used in RNNs and LSTMs.

If you found this article useful, please consider citing us:

title={Introduction to different activation functions for deep learning},
author={Jadon, Shruti},
journal={Medium, Augmenting Humanity},

Written by

BookAuthor@Packt(One-Shot Learning). Visiting Researcher@Brown University. CS grad@UMass Amherst. Website:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store