Neural Networks : The Math

Shashank Ravi
6 min readNov 24, 2022

Neural Networks stand out as one of the biggest computational achievements of our time. These algorithms are the holy grail of all Artificial Intelligence studies. The likeness of a Neural Net’s architecture to the human brain leads us to believe that Artificial General Intelligence (AGI) when achieved will be a series of interlinked Neural Nets. This article tries to condense the complex mathematics of Neural Networks into a few simple equations. I assume here, that the reader is familiar with the basics of Linear Algebra and Differential Calculus.

Image Source : V7 labs

Start Small : Perceptrons !

No ! These are not Transformers from an alien planet. Perceptron or the McCulloch-Pitts Neuron is the simplest form of Neural Network invented by McCulloch and Pitts in 1943, and later implemented by Frank Rosenblatt at the Cornell Aeronautics Laboratory in a project funded by the United States Department of Defense. (Groovy, I know !)

Perceptrons are simple Neural Nets that consist of n number of inputs, one neuron and only one output. The number inputs here refer to the number of features in our data set.

The process by which our data is passed through a Neural Network is known as Forward Propagation. This is how a Perceptron performs forward propagation :

Consider an input value xᵢ and weight value wᵢ . Weights are nothing more that values with which we multiply our inputs that determine the strength of a connection between two neurons. They decide how much influence an input has on the output.

STEP 1: Multiply the input values xᵢ with the weight values wᵢ and sum them up. That should look something like this :

Here, x = [x1, x2, x3…Xn] and w = [w1,w2,w3…Wn] are called row vectors of the input x and weight w respectively. The dot product of two vectors x and w is mathematically equal to their summation. i.e :

STEP 2: Adding Bias

A bias is a constant that ensures that even when all inputs are 0, there will be some activation in the neuron. (Don’t worry about this too much for now)

Add a bias b to the summation we obtained from Step 1. Let’s call this value z. This equation should now look like :

STEP 3: Pass the value of z obtained from Step 2 into a non-linear activation function.

An Activation function is a function that decides if a neuron should be activated or not. i.e It decides if the input into our neural network is valid in influencing the output. This is done using a couple of mathematical operations. This activation function needs to be non-linear as our Neural Network must be able to adapt itself to a variety of data and differentiate between the outcomes.

We will use the Binary Sigmoid Function for the purposes of this article, though Perceptrons in general used the Binary Step Function as their activation function.

Here, y-hat denotes the output obtained after forward propagation and Sigma denotes the sigmoid activation function.

LEARNING ALGORITHMS

The learning algorithm of Neural Net usually consists of two parts. Back Propagation and Optimization.

Backward Propagation or Back-propagation of errors is an algorithm that is used to compute the gradient(slope) of a loss function. A loss function is function that maps the target values with the predicted values. I.e It checks the performance of the neural network.

Back-Propagation consists of two steps :

STEP 1: Selecting the Loss function — Loss functions are generally selected with respect to the problem the Neural Net is looking to solve. For prediction/regression problems Mean Squared Error (MSE) is the commonly used Loss Function. MSE is the square of the difference between the actual value (yᵢ) and predicted value ( ŷᵢ ).

The average of the loss function calculated for the entire data set is called a Cost Function ( C ).

STEP 2: In mathematics, when we need to figure out how one quantity performs relative to another quantity we compute the gradients. In order to determine the ideal weights and bias for our Neural Net, we shall compute the gradient of our cost function with respect to weights and bias. This done with the help of Partial Derivatives.

Calculate the partial derivative of Cost Function ( C ) w.r.t to Weight w

By Chain Rule :

Now compute each of these gradients individually :

Gradient of C w.r.t to predicted value ( ŷ )

By the dot product rule we discussed earlier, the above equation now becomes :

Now, the partial derivative of C w.r.t z

Finally, The derivative of z w.r.t to w

Let’s now put all of it together :

You might be wondering, What happened to our bias ?

Theoretically bias is always assumed to have a unit value of 1. Therefore our equation will look something like :

OPTIMIZATION

Optimization generally refers to choosing the best weights and bias to ensure our neural networks produce optimal performance.

There are a number of optimization algorithms available. Here, we will look at Gradient Descent. Gradient descent changes the weights and biases proportional to the gradient of our cost function C. The extent to which these changes are carried out is controlled by means of a hyper-parameter called Learning rate. That is denoted by alpha (α).

The weights and bias are constantly changed until the Gradient Descent algorithm achieves convergence.

For a closer look at Gradient Descent, do check out my article linked below.

Also do check out my simple code tutorial to build your own Nueral Net in Python :

--

--