This is your activation function cheatsheet for deep learning. Use incognito mode if you need a free ride. Clap if you like it! Thanks.Your applause means the world to us. Practical usage of activation function is not hard. You just have to remember a few important details — rule of thumb and tricks.
- What is the input
- What is the output
- What is the range of the potential output? Is it between [-1, 1] or [0,1] for example.
- What does the activation function look like if it is graphed?
- What does the activation function derivative function look like if it is graphed? (It can hint at a vanishing gradient problem.)
- The trick is to graph the activation function if it is hard to understand.
Sigmoid, ReLU, Softmax are the three famous activation functions used in Deep Learning and Machine Learning. They are actually very easy to understand. This article introduces the tuition behind using each of the three activation functions.
In general activation function lends non-linearity to our model. You can see none of the activation function graphs is linear. You will see in this article, activation functions are really easy to understand for beginners. It can be implemented with just one line of code sometimes!
No matter how many hidden layers of complexity we add to our model, linear combinations of linear combinations are still linear. Adding activation functions allows our model to handle non-linear data.
Here are some pro tips to understand activation functions:
You will want to ask the following important questions as you choose the activation function use.
- What are its strengths and weaknesses?
- What is the range of output value?
- What is the type of output? Numbers, probabilities?
- What is its gradient or slope? In other words, how does the activation function change as input changes.
Intuition and Implementation for Sigmoid
Sigmoid is easy to understand implement. The range of sigmoid output is always between 0 and 1.
Here’s how to implement sigmoid function using numpy (needed for e to the xth power) and also its gradient.
import numpy as np#sigmoid(x) = 1/ (1+e^-x)def sigmoid(x):
return 1/(1+np.exp(-x))def sigmoid_prime(x):
return sigmoid(x) * (1 — sigmoid(x))
# scroll to the bottom to see math proof derivation of sigmoid_prime
Looking at the graph of a sigmoid function, visually, intuitively, we can see the vanishing gradient problem as x increases to infinity or decreases to infinity. The slope of the increasingly asymptotic line gets closer and closer to zero — vanishes.
Sigmoid squashes outputs between 0 and 1.
One advanced use of Sigmoid function is that logits, output data can usually be normalized between 0 and 1, and sigmoid generates outputs between 0 and 1. Technically Sigmoid can be used as activation function to predict any normalized data. For example, [0,1] could be [0,255] which is the min and the max value of a pixel. This makes Sigmoid versatile.
Vanishing Gradient Problem: Notice that Sigmoid becomes very flat when x goes into positive infinity and negative infinity. This graph behavior is also known as asymptotic. You can see the slope of the graph gets small, essentially very close to zero. That’s the intuition for the vanishing gradient program. Because Sigmoid squashes logits between 0 and 1, usually a small decimal, when using back propagation to multiply this with existing weights in a deep neural network, causing the weight change to be very very small. It will be hard for optimizers to make updates.
Intuition and Implementation for ReLU
ReLU sounds hard but is actually the easiest to implement.
Insert sigmoid graph and formula here todo
The above implementation of ReLU is worth memorizing. It is actually very simple isn’t it? It basically is saying, return whichever is higher (zero or the input x). Basically if x is negative, zero will be returned. ReLU will never return a negative value. Zero is the smallest possible output.
Relu should be applied to outputs of hidden layer so that they are consistent positive numbers. Relu’s role is to turn output positive or zero.
Relu is often applied for the outputs of hidden layers, sometimes for outputs for each of the hidden layers.
ReLU can have nice convergence properties. There’s also a variation called Leaky ReLU which makes sure negative numbers become small numbers close to zero but not zero (the classical ReLU would turn all negative numbers into zero). This helps avoid the vanishing gradient issue. This small negative number, called the negative slope, multiplies negative inputs and turn them into small positive numbers.
A leaky ReLU is like a normal ReLU, except that there is a small non-zero output for negative input values. — Udacity
Intuition and Implementation for Softmax
For detailed explanation and implementation of the Softmax function see our blog post Understand Softmax in Minutes. See our Softmax article for implementation and intuition.
While sigmoid squashes outputs between 0 and 1. Softmax squashes it between 0 and 1, outputs a vector of probabilities, and make sure the entire vector of outputs sums to 1
Softmax isn’t hard to graph, but isn’t trivial either. Depends on the range of the input, the graph can look very different zoomed in versus zoomed out. We can definitely cover this in another post how to graph softmax. Below is an example of what softmax can look like.
A commonly used activation function that is not covered in this article yet is tanh. Very useful in LSTM Recurrent Neural Networks.
Tanh function squashes input values to output between -1 and 1 not 0 and 1. Tanh is used in several steps of LSTM.