Perceptron and Journey towards Artificial Neural Networks(ANN)
Hello dear Readers….we are going to discuss neural networks now!
Before we actually get into neural networks, we should know how Neural Networks has made its journey and where it all began. Let’s get to it!
The idea of Perceptron was inspired by the Biological neuron present in our brains which transfers information through electrical impulses. All the sweet little things and bitter experiences that we remember, are because of inter connected neurons.
So, a “Perceptron is a building block of a Neural Network.”
An adult human brain consists of 100 billion neurons and 1000 trillion synaptic connections.
Scientists say that we humans have more amount of neurons than we require, after studying many animals, birds and other mammalia.
Unfortunately, we don’t use those to the maximum level of utilization as compared with other creatures given that they have less neuron to body ratio as compared to us .
Observe the image to the left: We have Inputs, Weights, Transfer function, Activation function and Loss function(will discuss in detail below).
These are formulated as per the inspiration from the biological neuron and the term Perceptron was coined.
Let’s talk simple: Whenever we perform any task, before or during performing it, we recollect or try to understand the procedures (or) set of instructions to perform the task and each instruction will be given a priority/importance and we try to avoid overdoing the task (or) under performing it,to get to the results that we expect.
To correlate between Neuron and Perceptron:
- Instructions = Inputs
- Priority/Importance = Weights
- Executing the instructions = Applying the Transfer function
- Avoiding overdoing and under performing = Threshold (Activation function’s role).
Simple, isn’t it? Now there are whole lot of calculations and varieties available at each and every level right from assigning the weights to the inputs to the last step which is activation function.
And….instead of applying machine learning algorithms and calculating so much, we can use the simple Perceptron model and classify the following problem easily:
Why did the Perceptron fail?
The answer is pretty intuitive: A Perceptron fails when the complexity of the task increases.
In the following problem:
A simple Perceptron model fails to classify properly, for the non-linear spaces.
But what if we can build a network on basis of connecting these Perceptrons? Will we be able to classify better? The answer is a big YES and we will be able to classify a lot better!
This thought led to the birth of Artificial Neural Networks (ANNs), which are fully connected feed forward networks. There are other neural networks such as CNNs and RNNs, which will be discussed later, not here in this one.
Single layered / Multi-layered Perceptron Model: Fully connected Feed forward network.
Fully connected means each input is connected to each neuron in the following layer.
Feed forward means that the connections are in forward direction.
ANN models are single and multi-layered Perceptron models. Any network with more than 2 layers is a deep neural network. Up to 2 layered networks are shallow networks.
As promised above, let’s talk about the following:
Inputs, Weights, Transfer function(s), Activation function(s) and Loss function(s).
Let us consider the equation: O = Ψ(wx + Θ)
Inputs (x): Inputs are the variables/columns in our data set which we use to train our model for prediction.
Weights/Features/Filters (w): Weights for training a model can be given equally among all the inputs or can be chosen to be given less or more to the inputs.
What happens after the weights are given?
A dot product is computed for the inputs and weights and this is passed onto the Transfer function.
Transfer function: Here our transfer function is a summation of the dot products obtained for each input and its weight. This function sums up the result at this point with the Bias Θ
Bias is just like an intercept added in a linear equation. It is an additional parameter in the Neural Network which is used to adjust the output along with the weighted sum of the inputs to the neuron. Moreover, bias value allows you to shift the activation function to either right or left.
Ψ: Ψ can be explained by understanding the Squashing / Activation / Threshold functions and Loss functions.
As per the above figure, Ψ can be replaced by any of the above functions.
Sigmoid function: We use this function while dealing with binary classification problems i.e., for two classes.
Softmax function: We use this function while dealing with multi-class classification problems i.e., for more than two classes.
Linear function: This function can be used for regression problems.
Tanh function: This function helps learn better and can be used while dealing with multi-layered Perceptron models.
Loss functions:
Squared loss, Absolute loss, Huber loss, Exponential loss, Logistic loss, Hinge loss, Cross entropy loss, multi-class hinge loss.
But above all, how to select your appropriate loss function?
An ideal loss function should be:
Robust: Should be robust to outliers thereby not exploding.
Non-ambiguous: Multiple coefficient values should not give same error.
Sparse: Should use as little amount of data as possible.
Exponential loss function:
Logistic loss function:
Hinge loss function:
Multi class Hinge loss:
Cross entropy loss of a Softmax Classifier:
Cross entropy between true and predicted distributions is defined as:
In cross entropies, predicted scores are converted to probabilities by normalizing, true probability is 1 for the right class. So, cross entropy loss is written as:
Optimizers and Error Back Propagation:
Optimizers are used to reduce the loss.
Optimizers and Error Back Propagation will be covered separately as there are a lot of them.
Some of them are Gradient Descent, AdaGrad,AdaDelta, Adam, RAdam, Batch Gradient Descent, Mini-Batch Gradient Descent :)
Error Back Propagation is where the weights have to be updated till the error is least possible.
Single-layered Perceptron Model:
Multi-layered Perceptron Model:
Geometric Pyramid rule for hidden layers:
When input has m nodes and output has n nodes, the hidden layer should have squared root number of m x n nodes.
Network Topology:
At a given time, the Number of Weights in the network are: H * (I + O) + H + O, where H = number of units in the hidden network, I = number of input features, O = number of output nodes.
Example: 6 Inputs and 10 units in hidden network, 2 output nodes will have 6 * (10 + 2) +6 + 2 = 80 weights in the network.
Thank you for reading this article this far….see you in the next one!