A neural network is a network of artificial neurons connected to each other in a specific way. Job of neural network is to learn from given data. The prediction function that neural network must learn can be highly non-linear. Activation functions for artificial neurons are chosen to capture underlying non-linearity.
Activation functions (generally) have functional form of f(u)=f(w ᵀ x+b), where w is weight vector and x single training data vector
This can be treated as linear combination of inputs, followed by a non-linear transformation. There are multitude of options available to chose a non-linear transformation. Some of the prominent ones are as follows.
1. Sigmoid Activation Function
A sigmoid function, f(u)= 1 / (1 + e⁻ ᵘ). It takes a real-valued number and “squeeze” it into range between 0 and 1. Large negative numbers become ≈0 and large positive numbers become ≈1.
For binary classification problem it is used as activation of output layer of a neural network.
- Can saturate and kill gradients: When neuron’s activation saturates at 1 or 0 , the gradient becomes almost zero. This creates difficulties in learning.
- Outputs are not zero-centered: Since outputs are in range 0 to 1, neurons in next layer will receive data that is not zero centered. Hence, gradient of weights w during backpropagation will be either all positive or all negative, which can cause undesirable zig-zagging dynamics in gradient updates of weights. When considering gradients added over all training data in a batch, this problem will not be much severe compared to “Saturate and kill gradients”
2. Tanh Activation Function
A tanh function, f(u)=sinh(u)/cosh(u). It takes a real-valued number and “squeeze” it into range between -1 and 1. Large negative numbers become ≈−1 and large positive numbers become≈1.
It is preferred over sigmoid because its outputs are zero centered
Can saturate and kill gradients: When neuron’s activation saturates at 1 or -1 , the gradient becomes almost zero. This creates difficulties in learning.
3. ReLU Activation Function
The Rectified Linear Unit, ReLU is f(u)=max(0,u)
- Greatly increase training speed compared to tanh and sigmoid
- Less expensive computations compared to tanh and sigmoid
- Reduces likelihood of the gradient to vanish. Since when u>0, the gradient has constant value.
- Sparsity: When more u<=0, the f(u) can be more sparse
- Tends to blow up activation (there is no mechanism to constrain the output of the neuron, as u itself is the output).
- Closed ReLU or Dead ReLU: If inputs tend to make u<=0, then the most of the neurons will always have 0 gradient updates hence closed or dead.
4. Leaky ReLU:
It solves the dead ReLU problem. 0.01 is coefficient of leakage. Leaky ReLU is as follows:
5. Parameterized ReLU Or PReLU:
Parameterizes coefficient of leakage αα in Leaky ReLU.
Generalization of ReLU, Leaky ReLU and PReLU. It does not have functional form of f(u)=f(w ᵀ x+b) , instead it computes function max(w′ᵀx+b′,wᵀx+b)
Maxout has pros of ReLU but doesn’t have dead ReLU issue
It has twice number of weight parameters to learn w′ and w
A sofmax function is generalization of sigmoid function. Sigmoid is used for 2-class (binary) classification whereas Softmax is used for multi-class classification. As shown in above figure Softmax function turns logits [2.0, 1.0, 0.1] into probabilities [0.7, 0.2, 0.1]
What Activation Function Should I Use ?
- For output layer, use sigmoid or softmax in classification task
- For output layer, use no activation or Purelin function f(u)=u in regression task
- Use the ReLU non-linearity, if you carefully set learning rates and monitor the fraction of “dead ReLU” in network.
- Else try Leaky ReLU or Maxout.
- Or try tanh, although it might be worse than ReLU
- Avoid sigmoid
Given f(x)f(x) and g(x)g(x) are differentiable functions (the derivative exists), cc and nn are any real numbers: