Mathematical foundation for Activation Functions in Artificial Neural Networks

The foundation of Artificial Neural Net or ANN is based on copying and simplifying the structure of the brain. Like the brain, ANN is made of multiple nodes called the neurons which are all connected to each other in varying proportions, similar to synapses.

[If you are just beginning, I have explained the basic premise of how ANNs work at a very high level in the previous post]

The connection between the neurons have a weight to represent the strength of the connection. The weights models the synapses in the real brain which links the neurons.

Positive weights are used to excite other neurons in the network and negative weights are used to inhibit other neurons.

As the illustration shows, A simple Feed Forward Neural Network (FFNN) has a layer of input neurons, a layer of hidden neurons and a layer of output neurons. It is called Feed Forward due to the direction of the arrows where the connection and weights represents only one way from input to hidden to output layer.

The structure of ANN can be broken down to it’s Architecture (or topology of how the network is structured), the Activities (Or how one neuron responds to another to produce complex behavior) and the Learning Rule (or how the weights of the connections changes over time, w.r.t input, output and error).

There can be many hidden layers in the Architecture of ANN which makes it deep. This is also called Deep Neural Network and is the premise of all things DeepLearning.

Activities — Activation Function of ANN

The most important aspect of ANN lies in its Activities also called it’s Activation Function. The objective of an Activation Function is to introduce non-linearity into the network. Note that only non-linear activation-functions are used in ANN. Without non-linearity, a Neural Net is useless to produce complex behavior. The output of a linear activation function is also linear, which is not qualitatively helpful. A linear activation function dampens the effects of a deep network topology which reduces the whole network to a single layer (even if the topology has deep architecture)

The basic idea of connectionism is to use simple neuron units which interconnect with each other and produce complex behavior. Without non-linearity, you shall not be able to achieve this complexity.

Consider the above illustration in which, there are many input neurons { x1, x2.. xn } which are all connected to different hidden units {y1, y2,… yn }. Each input neuron, is connected to every hidden unit, and the connection between each input unit to a hidden unit has a connection weight Wij where ‘i’ is the input unit and ‘j’ is the hidden unit.

Let’s zoom in and expand the relationship a bit further to understand how the activation function is applied.

The above illustration provides a view of a single hidden unit, which is getting its inputs from multiple input units. You can notice that there are 3 specific functions introduced.

  1. Transfer Potential : Which aggregates the inputs and its weights.
  2. Activation Function : Which applies a non-linear transfer function or “activation” on the transfer potential.
  3. Threshold Function : Depending on the activation function, the threshold function to either “activate” the neuron or not.

The transfer potential can be a simple summation function which is a sum of inner dot products of the input to the weights of the connection.

Typically, the transfer potential is mostly a inner dot product as illustrated above, but it can be anything. For example, it can be a Radial Basis Function (RBF) like a Gaussian. It can also be Multiquadratics or Inverse-Multiquadratics function.

The activation function should be any differentiable, non-linear function. It needs to be differentiable so that the learning functions can find an error gradient (will explain in later posts), and it has to be non-linear to gain complex behavior from the neural net.

Typically the activation functions used is a logistic sigmoid as follows:

where, theta is the “logit” which is equal to the transfer potential function as follows:

The overall network with multiple layers feeding-forward the non-linearity to encapsulate complex behavior from simpler neural units looks as illustrated:

The idea to model the network this way is a direct reflection of the activities of the brain where neurons communicate with each other by firing or activating each other through its action potential. The activation function simulates the “spike train” of the brain’s action potential. This is quite mesmerizing, intuitive and simple to understand.

Activation functions are NOT limited to a logistic sigmoid. They can be any of the following:

(Note that the variable x in the following functions does not represent the inputs but the transfer potential)

Binary Step function :

Hyperbolic Tan :

Rectified Linear Units (ReLU) :

Exponential Linear Units (ELU) :

SoftExponential :

As each layer feeds to the next, the final output activates one or several of the output neurons to denote the end of the “spike train”. The output neurons that are activated then proposes the final answer to a pattern recognition problem, classification problem, anomaly detection problem etc.. Typically during training the output is checked for accuracy and the error is converted to some delta to the weights and fed back.

The learning of the error is the cost function of the network and feeding back is called backpropagation. Will introduce these concepts in next set of posts.

In conclusion, the choice of which activation function to use is completely dependent on the network architecture, the topological structure of the input feature vector, learning functions, cost functions and how the learning is optimized in the neural network.