Activation Functions are decision making units of neural networks. They calculate net output of a neural node.
Heaviside step function is one of the most common activation function in neural networks. The function produces binary output (that is the reason why it also called as binary step function). The function produces 1 (or true) when input passes threshold limit whereas it produces 0 (or false) when input does not pass threshold. That’s why they are very useful for binary classification studies.
Actually, all traditional logic functions can be implemented by neural networks. Step functions are therefore commonly used in primitive neural networks without hidden layer or widely known name as single layer perceptrons.
Conversely, non-linearly separable problems (ex: XOR gate, where XOR= eXclusive OR) require a different approach:
Non-linear just means that the output we get from the neuron, which is the dot product of some inputs x (x1, x2, …, xm) and weights w (w1, w2, …,wm) plus bias and then put into a function (ex Sigmoid), cannot be represented by a linear combination of the input x (x1, x2, …,xm).
In the XOR problem, two classes (0 and 1) can't be separated by a single linear line. In this case, a multilayer neural network needs to be put in place to satisfy XOR gate results.
In practice, a backpropagation algorithm is run to train multilayer neural networks, updating weights. It requires a differentiable activation function → the algorithm uses derivatives of activation function as a multiplier. But derivative of the classic step function shown above is 0. This means gradient descent won’t be able to make progress in updating the weights and backpropagation will fail. That’s why, sigmoid function and hyperbolic tangent function are common activation functions in practice because their derivatives are easy to demonstrate.
- Sigmoid Function → a small change in any weight in the input layer of a multilayer perceptron could possibly lead to one neuron to suddenly flip from 0 to 1, which could again affect the hidden layer’s behavior, and then affect the final outcome. Ideally, we'd like a learning algorithm that could improve the neural network by gradually changing the weights — not by flat-no-response or sudden jumps.
A sigmoid function produces similar results to step function in that the output is between 0 and 1. The curve crosses 0.5 at z=0, which we can set up rules for the activation function, such as: If the sigmoid neuron’s output is larger than or equal to 0.5, it outputs 1; if the output is smaller than 0.5, it outputs 0. Sigmoid functions don't have a jerk on the curve — it is a smooth curve , and it has a very nice and simple derivative of σ(z) * (1-σ(z)), which is differentiable everywhere on the curve. Last but not least, If z is very negative, then the output is approximately 0; if z is very positive, the output is approximately 1; but around z=0 where z is neither too large or too small (in between the two outer vertical dotted grid lines), we have relatively more deviation as z changes.
What a non-linear activation function does, when used by each neuron in a multi-layer neural network, is therefore to help produce a new “representation” of the original data, and ultimately allows for non-linear decision boundary. Hence, in the case of XOR, if we add two Sigmoid neurons in a hidden layer, we could, in another space, reshape a 2D graph into something like a 3D image.
(equivalent to drawing a complex decision boundary in the original input space):