Glossary of Deep Learning: Activation Function
Fundamentally, neural networks are layered collections of nodes, each of which receives a set of inputs and individually makes the decision whether to fire, and propagate information downstream to subsequent layers.
The inputs are combined with weights and biases local to each node, which are updated by a learning algorithm in response to the observed error on training examples. This enables patterns in data to be learned by the neural network, where the value of the weight is proportional to its importance. For an example of how weights contribute to problem solving, see this interactive demo, or this one too.
The weighted input data from all sources is summed to produce a single value, (called the linear combination), which is then fed into an activation function that turns it into an output signal.
One of the original designs for an artificial neuron: the perceptron, had a binary output behaviour. Perceptrons would compare their weighted inputs to a threshold and if it was exceeded, the perceptron would be activated and output a 1, (otherwise a 0). But this output is a step function, which is not continuous or differentiable, and that’s not as useful because differentiation (finding how incorrect we are) is what makes gradient descent possible.
A better choice would be something like a sigmoid function, which replaces the step thresholding with an S-shaped curve. This activates like a perceptron, firing in response to the sum of its inputs, but the sigmoid function makes the output continuous, and so differentiable. Conceptually, the activation function is what makes decisions: when given weighted features from some data, it indicates whether or not the features are important enough to contribute to a classification.
The sigmoid function also has the very useful property that small changes in the weights and bias cause only a small change in output. That’s crucial to allowing a network of neurons to learn.
Hence the purpose of the activation function is to introduce non-linearity into the neural network.
This allows us to model a response variable that varies non-linearly with its explanatory variables (one where the output couldn’t be reproduced from a linear combination of inputs). Without a non-linear activation function, the neural network would behave just like a single-layer perceptron, because no matter how many layers it had, the composition of linear functions is always just a linear function (effectively a least-squares calculation).
In Python, an activation function looks like this; notice how it’s a non-linear function that’s applied to the dot product of weights, inputs and biases:
return 1/(1 + np.exp(-x))
output = sigmoid(np.dot(weights, inputs) + bias)
- Which Activation Function should I use? — Siraj Raval video
- DL Book Chapter 6 — detailed walk-through of neural net maths
- List of activation functions — on Wikipedia
- CS231n explains Activation Functions
- “What’s the role of the activation function in a neural network?” — Quora