The mathematics of Neural Networks

10 min readApr 30, 2018

It has been a long time since neural networks and deep learning shook the world of Machine Learning and AI as a whole, but still very few people are actually aware of the mathematics that happens behind the scenes as the neural networks are trained and executed.

With machine learning frameworks such as Tensorflow and Pytorch becoming more and more popular with each passing day the barrier of entry for developers into the domain of deep learning is shrinking rapidly, however there is worrying trend where an increasing number of people with little to no mathematical knowledge about this domain are entering it. These are the people who when asked to explain how a neural network trains just answers something along the lines of — “Its backpropagation”, and although I agree that abstractions are useful in practical scenarios where you do not care how the algorithm gives you the result, the problem with the backpropagation algorithm is that it is a leaky abstraction, and one should understand how the network trains (even if people still don’t fully understand what the network is doing when crunching data under the hood).

In this article, I will attempt to explain backpropagation assuming you are a high school student who only knows basic Algebra, Vectors & Calculus. So without further ado, lets dive right in.

The Taylor series

Assume you have a function f(x), and you want to write it as a sum of infinite polynomials with increasing degree:

Since any function f(x), can be approximated arbitrarily by an nth degree power series, the above equation is valid. What we want to find are the value of the b’s, when it is approximated near the point a (.i.e. when we want to approximate the value of f(x) at point with some finite degree polynomial).

Thus, from these above equations you can see that:

From this we can reformulate our first equation as:

The above power series is called the Taylor series of function f(x) that is infinitely differentiable at point a. This equation forms the backbone of some the most frequently used optimization techniques in machine learning like gradient descent which we will look at in just a moment.

Functions in higher dimensions

In machine learning and deep learning we often have more that one input parameter so the function is generally a multi-dimensional function(e.g an n-dimensional function), thus rather than a scalar input we have a vector input:

What will the Taylor series of f(x) be when x is a vector? For the sake of simplicity let us approximate f(x) at point a with only the first two terms:

Note here that a is also a vector and f(x) is a function that takes in a input vector x and outputs a scalar. Now the real question is what is f’(a)(x-a)? Here (x-a) and f’(a)(I will get to that in a moment) are both vectors so the first thing one should realize is that this equation is wrong for a general n-dimensional scenario as the equation does not make any sense since both of them are vectors. But both the vectors are of the same dimension(again, I will get to that) so a dot product of the two vectors will in-fact give you a scalar, thus using this hand-waving:

Now, what really is f’(a)? Well it is a function differentiated by vector x at point a(which is also a vector). When you differentiate a function with respect to a vector you end up with another vector with the same dimension, more specifically:

This is also called the gradient of f(x) with respect to x denoted by ∇f. Therefore, the gradient of the function f(x) at point a is :

Here g is the gradient of the function f(x) at point a, and therefore:

Ok, this should be all the hand-waving arguments that we will need, now lets get into the interesting parts.

The Gradient Descent Algorithm

How do we approximate the local minima of f(x)? Let us define g(n), w(n+1) and Δw(n) as:

where η is a positive constant called step-size, or learning rate and g(n) is the gradient vector evaluated at the point w(n). Now by Taylor series expansion, if we approximate f(w(n+1)) at the point w(n) with only the first two terms in the Taylor series, we get:

which shows that, for a positive leaning rate parameter η, f(w(n+1))≤f(w(n)), thus the gradient descent algorithm is to simply update

w(n+1) ← w(n) - η ∇f(w(n))

till the gradient converges to zero, which would imply w(n) has converged to the local minima of f(x).

Please Note: The above inequality may not hold true always, here I am assuming that f(x) approximates fairly well to its Taylor series expansion of its first two terms, if this is not true the algorithm may not work.

Neural Network

Finally, now we are at a state where we can talk about neural networks:

Now I am sure you have seen the above picture before( unless you are living under a rock). Lets just mathematically define the above picture, starting with the simplest thing the input: It is basically just x with each circle denoting a specific value, for the above picture it would be: x1, x2 and x3 which would combinedly form the input vector x. All the other neurons except for the input are individually called perceptrons:

Basically what a perceptron does is that it multiplies each input (denoted by a’s in the above picture) that it gets with a specific weight( the weights are denoted by the w arrows and each arrow represents a different weight), adds them all up, adds another term called the bias( denoted as b in the above picture), passes the sum through some function( denoted as σ(.) in the above picture) and outputs the result.

What we want is that the neural network gives us desired outputs in the output layer. But how do we make the neural network understand what we want? We give it a lot of examples(i.e. we give it access to the corresponding output y for some given input x).

It is clear to see that the neural network as a whole is just a function that takes in input x and outputs f(x), so mathematically what we want is for the given x input the error between f(x) and y be as small as possible. For this purpose we shall define a loss function: ℒ(f(x), y), thus mathematically training a neural networks is quite simply:

Where w is the vector of all the weights( including the biases) of each node in the neural network and w* is the ideal weight vector such that the sum of the loss function value for each training pair x and y coming from training data D is as small as possible. Because we want to solve this equation, it was necessary to tell you about the gradient descent algorithm before directly going into neural networks. Given fixed values of x’s and y’s the loss function entirely depends on f(x) and when x’s are fixed, f(x) entirely depends on the weight vector w, thus the above equation is just a function of w where we can perform gradient descent to come to a solution( .i.e. train the network)

The Backpropagation algorithm

Ok, so finally we are here at the crux of the matter — the backpropagation algorithm. So what does it do? It basically gives you are very convenient way of calculating the gradient of the weight vector so that you can perform gradient descent. In order to properly explain how the algorithm gives us the gradient vector lets concentrate on a sub part of the network:

As in the above diagram let us consider three hidden layers of the neural network l, i and j respectively, what we want to find is the partial derivatives of the loss function ℒ(w) with respect to each and every entry of the weight vector w, mathematically what we want is:

Please note that layer a will also include the input layer. With this kind of formulation if we also want to take into account the bias term of each neuron, the simplest thing one can do is add an extra neuron in layer a that always outputs 1. This will make the corresponding weights connected to that neuron to each neuron in layer b behave like their biases.

Let us initialize the weight vector to some random weights:

Now, let us define the above constructs in the diagram:

Now, lets begin the derivation of this algorithm, I hope you remember the chain rule of derivatives:

Similarly, you would expect:

However, by writing this you are implying that a change in a single neuron in layer i will only effect a single neuron in layer j, but we know from the neural network architecture that a change in the output of a single neuron in the hidden layer will effect all the neurons in the next layer, thus the correct formulation is:

Similarly, the second part of the derivative can be written as:

Combining everything we get:

So we can see that the 𝛿 of each layer is dependent on the 𝛿 of the next layer. Now if we can only find the 𝛿 of the very last layer( the output layer) the algorithm will be complete.

Lets assume for the sake of simplicity that the output layer has only neuron( it is left upon the reader to extend it to multiple output neurons), let this be layer k. Then we can write:

But remember that the loss metric ℒ(f(x), y) is a function of f(x) which is nothing but the output of the neural network for input vector x, thus:

And this completes the derivation of the backpropagation algorithm, all the reader needs to do now is to use an appropriate loss function ℒ(f(x), y) and an activation function σ to replace the derivatives in the algorithm with the derivatives of those functions. You can now also see why this is called the backpropagation algorithm — the error is back propagated from the very last layer to the very first layer and in the process the gradient vector of w is calculated.

Demo

Now, I am well aware that this blog is already quite huge, so I am not going to go through the implementation of the code here. However if you want to try out the algorithm for yourself here is the link. In the blog the author uses the Least Mean Square(LMS) loss function along with the sigmoid activation function. Please note that in the code you will find the author is using ‘+’ when updating the weights using gradient descent when actually it should be ‘-’ but still the code works correctly because he is using the negative of the derivate of the loss function.

It is up-to the reader to prove the above equation.