Know your Activation Functions

Published in

Analytics Vidhya

6 min readSep 23, 2019

When building neural networks, one of the most confusing parts for beginners can be simply knowing which activation functions to use and why. It is also one of the most important choices since they will quite literally shape the information you get out of your neural network, and the wrong choice may in some cases even “kill” neurons.

The choice of activation functions is a very important one, but first, why do we even use activation functions? Here’s a brief history lesson.

Why do we even use activation functions?

Before there were neural networks there were perceptrons: a binary classification function which followed the same WX+B form of modern neural networks. That is to say, each input x would be multiplied by a weight w and a bias b as it passed through each layer. Where perceptrons and neural networks differ is in the fact that whereas the output of each multi-layer perceptron would be accompanied by a step function, in the case of a neural network it would be accompanied by an activation function.

In essence, a step function is simply the most basic activation function you can have: each value becomes 1 if it is greater than some arbitrary threshold, else zero. Neural networks and activation functions by contrast allow the values coming out of each layer to run a range of values, albeit typically constrained between 0 and 1. Why would having a continuous range of possible outputs be better? Having outputs that are simply 0 or 1 is much simpler, what does the extra complexity do for us? The answer is that it makes backpropagation, the algorithm for training and optimizing neural networks, possible.

And of course understanding why step functions make backpropagation impossible would also require knowing how backpropagation works. 3blue1brown has an excellent series explaining backpropagation in juicy mathematical detail, but the short version is that the function’s output most slope gradually in order to make it obvious how the algorithm must adjust.

Neural networks are updated using an algorithm called gradient descent, which quite literally descends (error) gradients. So imagine that for classifying a particular image (or whatever the task may be), the “right” value for a neuron to have is a 0. In that case, we’d step to the left of this plot so that the neuron’s output would decrease and what our neural network predicts would get slightly more correct.

Outputs (y) given inputs (x) with a sigmoid activation function. Blue lines are tangent lines representing error gradients.

Of course, in reality the error landscape will have as many dimensions as we have neurons, so it won’t look like the plot above (or anything we can humanly imagine, for that matter). But the gradient with respect to this one singular neuron will still follow this shape, and that’s what matters.

So that’s the basic idea: we’re standing on a hill of incorrectness and we’re going to step into the valley of slightly-more-correct-ness, and we know which way to step by measuring the slope of the hill where we’re standing. But what if instead of on a hill, we’re standing on a plateau next to a cliff? This is the problem of the step function.

With a step function, our error gradients are useless.

The value simply jumps straight up at some arbitrary point. Even that tangent line that seems to be right at that point isn’t quite on it, which is why it has remained straight. This simple detail completely breaks the backpropagation algorithm, which is the only algorithm fast enough to figure out the right combination of weights and biases given a massive combinatorial explosion of options.

With that background out of the way, lets now move onto all the specific use cases (and failure modes) of the most commonly used activation functions.

Sigmoid

Sigmoid is the most commonly used activation function of all. It can accept any real number as input and will return a version squished between 0 and 1 as output. This is useful for binary classification, in which 0 represents choosing one class and 1 represents choosing the other. Sigmoid, and in fact most activation functions, are sometimes colloquially called “squishification functions” due to the fact that they “squish” a full gamut of inputs into a much smaller range of outputs.

Softmax

Softmax is simply a variant of sigmoid with the unique quality that the values it outputs will sum to 1 no matter how many classes you have, making it useful for multiclass classification. For example, if you have an animal image classifier with a class for horse, dog, cat, and rabbit, then softmax might give you probabilities of 0.13, 0.09, 0.23, and 0.55. In this case you would classify the image as containing a rabbit since the neural network had the highest confidence of that, 55% confidence.

ReLU (Rectified Linear Unit)

ReLU is arguably the simplest activation function of all: it simply returns a 0 for any input less than zero, and otherwise the input itself. In other words, it rectifies negative inputs into zeros. ReLU is useful for the training stability of hidden layers because it circumvents the vanishing gradient problem.

Look back at that sigmoid function from earlier and notice how at the far left and far right, it plateaus. Uh oh. Remember, plateaus like that break backpropagation. What actually happens here is that since computers represent decimal numbers to a finite amount of precision, eventually a number that is very close to zero becomes a zero. Thus a neuron “dies”: its error gradient is 0, and so it stops moving in either direction.

ReLU suffers from this too of course since its error gradients for the entire negative spectrum of inputs are simply 0. However, in practice it still tends to exhibit far better stability than activation functions such as sigmoid. So if for example you are building a neural network to do classification, you would want the activation function on the final layer to be sigmoid or softmax, but for the hidden layers you should probably prefer ReLU for its training stability.

Leaky ReLU

Leaky ReLU is the same as ReLU except that it will keep negative values slightly negative, making it useful for avoiding the vanishing gradient problem of ReLU. This is done by simply multiplying the negative values by a smaller decimal such as 0.01.

Hyperbolic Tangent (Tanh)

Is similar to sigmoid, except its outputs range between -1 and 1 instead of 0 and 1. What’s more, the slope of this function is more extreme — literally hyperbolic — which makes the error gradients steeper and thus accelerates training. The flip side of this however is that its plateaus are more pronounced as well, making it more susceptible to the vanishing gradient problem. The hyperbolic tangent function is useful as an alternative to sigmoid that trains faster, when it works.

Identity Function

The identity function isn’t really a function so much as the lack of a function. This is useful for regression: if we want our outputs to run a full gamut of values then there really isn’t any point in warping or constraining that gamut. In this case we simply attach no activation function to our final layer, we take the outputs simply as they are.

Conclusion

This is a non-exhaustive list of the activation functions and their utilities. Additionally, there are many other nuances to activation functions, backpropagation, and gradient descent to be learned. If you want to dig deeper then I recommend Andrej Karpathy’s blog Yes you should understand backpropagation (which talks about issues such as vanishing gradients) and the Efficient BackProp paper by LeCun et al. However, this brief introduction should be more than sufficient to begin building some simple neural networks or fine-tuning pretrained models using libraries such as Keras, PyTorch, or TensorFlow. Thanks for reading!