An Introduction To Mathematics Behind Neural Networks

Published in

Analytics Vidhya

9 min readAug 3, 2020

Machines have always been to our aid since the advent of Industrial Revolution. Not only they leverage our productivity, but also forms a cardinal factor in determining the economy of a nation. They have undergone a range of improvements over the past few centuries and have seen a wide variety of aberrations ever since their first appearance. But one thing remained more or less the same, which is it’s explicit dependency on human minds for the commands. Over the last couple of years, researches are being carried out to inflict intelligence on machines. Thus, the term “Artificial Intelligence and Machine Learning” came into existence. My previous blog “A Brief Introduction to the Term Machine Learning” gives you a very clear insight.

Neural Networks have evolved themselves to be a germane subsidiary of Artificial Intelligence. The definition and working of Neural Networks was given in the blog “Neural Network: An Art to Mimic Human Brain”. In this section, I will provide a detailed explanation of the mathematics behind these networks. From my perspective, it is tantamount to know what exactly these networks are as well as how exactly it works, for which one should have a clear idea about the maths behind it. It would definitely render you the flexibility to play around with different things to obtain the results we want.

A Neural Network is basically a dense interconnection of layers, which are further made up of basic units called perceptrons. A perceptron consists of input terminals, the processing unit and the output terminals. The input terminals of a perceptron are connected to the output terminals of the preceding perceptrons.

A single perceptron receives a set of n-input values from the previous layer. It calculates a weighted average of the values of the vector input, based on a weight vector w and adds bias to the results. The result of this calculation is passed through a non-linear activation function, which forms the output of the unit. The following figure gives a comprehensive idea regarding the summation of n-inputs, post multiplication with the corresponding weights. Prior to taking the output ŷ, the weighted sum is passed through an activation function. In some cases, a bias is added along with the weighted sum, before the activation phase.

Now let’s leverage the scenario from that of a single perceptron to a dense stack of multiple perceptrons which collectively constitute a layer. A network having three layers is shown in the figure below.

In this figure, L1, L2 and L3 represent a cascade of three layers, where each layer is a stack of perceptrons piled upon on one another. For mathematical modelling, let us vectorize all values including the input, the output and the intermediate weights.Keeping L2 as the reference layer, the input vector arises from output of layer L1. Whereas the output vector of L2 is fed as the input of L3.

While transition to the general notation for layer, we use the vector X to indicate the input vector, where X=[x1, x2, x3]. The output vector ŷ constitute the respective outputs of the four perceptron units in L2 layer, where ŷ = [y1, y2, y3, y4]. Moving to the weights, a weight is the parameter which transforms the input data within the network. Each layer is characterised by a unique weight matrix W. Here, L2 has a weight matrix W and each element of the matrix is represented as W[r,q], where r,q denote rows and columns respectively.

The components of W are weights connecting the input element to the corresponding perceptrons in a given layer. The first index, r represents the element of input vector X, which will be entering the layer L2. The second index, q represents the perceptron in layer L2, where the input is entering to.

The weighted average of input vector with that of the respective weights can be mathematically modelled as the dot product of the corresponding vectors. Using dot product, we multiply the input matrix X by the transpose of the weight matrix W. Transposing is done in order to match the dimensions of W and X for the dot product. Post this procedure, we add the bias vector using matrix addition. These steps are collectively termed as forward propagation. The mathematical equation turns out to be:

Now, there is one last step in the forward propagation process, which is the non-linear transformations by an activation function. Let us understand the concept of non-linear transformation and it’s role in the forward propagation process.

Traditional machine learning algorithms were solely based on the assumption that the relation connecting the input and output labels is linear. They emphasis on inducing linearity while deriving the equation. But the truth turns out to be an aberration. All universal phenomena are infact non-linear in nature. The linear transformation alone cannot capture the complex relationships. Thus, we introduce a new component in the network which inflicts non-linearity to the data. This new addition to the architecture is called the activation function.

Activation functions determine the accuracy of a deep learning model and also the computational efficiency of training a model. Activation functions also have a major effect on the neural network’s ability to converge, that is finding the optimal weights and biases. Without them, our neural network would become a linear combinations of linear functions. Different activation functions are employed of which relu, sigmoid, tanh and softmax are extensively used.

After the forward propagation, each neural layers gives out an output vector ŷ, which is passed on to the next layer as the input vector and this process continues till the last layer. Now at the final layer of the network, we yield the output of the whole network. During the initial forward propagation process(i.e. iterating through the network for the first time), we randomly initialise the weights and biases. These values are treated as parameters of the neural network algorithm. Now we have to tune these parameters in accordance with the dataset or in general we can call it “the problem statement”. This tuning of the weights and biases is done with the help of another set of algorithm called backward propagation.

We have the output vector ŷ at the end the network. The error of this particular output from the network is calculated with respect to the actual expected output, since it is a supervised learning after all. The sum of errors of all the data points is measured to draw the total error. The end goal is to keep this total error as low as possible, by tuning the weights and biases in the constructed network. Inorder to built a mathematical model, we construct the loss function, which is a mathematical equation connecting the total error with the weights and biases of the network. A Loss function maps a set of parameters onto a scalar value which signifies how well these parameters accomplish the desired results. If your predictions are poor, then your loss function will output a higher number. If they’re pretty good, it’ll output a lower number. A generalised equation of the loss function is given by:

y : the actual output

ŷ : the predicted output

n : the total data-points

where ŷ is given by:

f(z) in the above equation is the activation function applied on the weighted sum of input vector. From the above equations, we can derive the notion that loss function clearly relates the total loss incurred by a model with a set of parameters, which are weights and biases. We are still left with the task to find the optimal parameters for which the loss function yields a scalar of minimum value. This task is often entrusted with an optimisation algorithm called gradient descent.

“Gradient descent is a first order iterative optimisation algorithm for finding a local minimum of a differentiable function. To find a local minimum of a function using gradient descent, we take steps proportional to the negative of the gradient(or approximate gradient) of the function at the current point.” says Wikipidea, which is an exquisite definition.

In the backward propagation process, the model tries to update the parameters such that the overall predictions are more accurate. The forward propagation traverses from “left-to-right” or from the beginning of the network to the end, whereas the backward propagation moves from “right-to-left” or from the end of the network to the beginning.

How do neural networks learn?

The learning process is nothing but changing the values of the W(weights) and b(biases) parameters so to minimise the loss function. From the dataset, each point is taken, forward propagated and the loss is obtained. The information about the error is back propagated, in reverse through the network, so that it can alter the parameters.

For each iteration we will obtain the values of the partial derivatives of the loss function with respect to each of the parameters of our network, which can be decomposed as below using the chain rule.

The detailed definition of each terms in the above equation are as follows.

In order to use the acquired gradient descent to learn the model coefficients, we simply update the weights W by taking a step into the opposite direction of the gradient for each pass. This motion of the algorithm terminates when we reach a global minima, which is a point having zero gradient. The equation involving the update of weights is :

W =W−α(∂Loss(y,ŷ)/∂W) where α is the learning rate, ∂Loss(y,ŷ)/∂W is the partial derivatives of the loss function with respect to the weights.

Training is an iterative process. To fetch better results, we perform multiple training iterations. The training iterations reduces the total error of the neural network. As the weights are updated, the neural network produces more desirable outputs. The total error of the neural network should fall as it is trained. A graphical representation of the gradient descend trying to find an optimal minimum for the loss function is shown below. The weights and biases corresponding to the optimal minimum is taken as the final model parameters of the network.

Source : Gradient Descent 3D — Visualization by
Christopher Gondek

A wide range of optimisation techniques have evolved inorder to further optimise the gradient descend algorithm. We are also left with the flexibility to fine tune the speed with which the algorithm operates. Along with that, we also have to make sure that the gradient is not stuck in the local minima instead of the global minima. I will plunge into all these topics in my upcoming blogs.

In this blog, I have presented you with the mathematics that takes place inside Neural Networks and how exactly it works. Comprehending the basics of this process can be very helpful. I hope this blog was helpful and would have galvanised you to dive deep into topic.

An Introduction To Mathematics Behind Neural Networks

How do neural networks learn?

Written by Gautham S