Neural Networks: A Newbie’s Guide, by a Newbie

7 min readFeb 6, 2024

One of the most important concepts everyone (including me 😬) struggles with in Deep Learning, is the intense math behind how a Neural Network actually works. I finally figured it out, and have simplified it for you! 😅✨

What are we up against?

This complex monstrosity 😈

Don’t worry. That’s exactly how I felt when I was a newbie! 😅

Let’s start with some Basics:

Input Layer - The layer where you provide all your inputs to the Neural Network. Typically, the number of neurons in the Input layer is equal to the number of Inputs to the Neural Network.

Hidden Layers - The layers between the Input layer and the Output layer. This is where all the intense computations happen. For simplicity, I have assumed our Neural Network has only one Hidden layer, but in reality, this is almost never the case.

Output Layer - The layer where you receive the outputs generated by your Neural Network. Typically, the number of neurons in the Output layer is equal to the number of Outputs expected from the Neural Network.

Forward Pass - The forward pass involves the computation of the network’s output by applying input data through successive layers, where each layer performs weight multiplications and activation functions on the input, propagating the information to the output layer, to produce an output or prediction.

Backward Pass - The backward pass, or backpropagation, is the process of computing gradients of the loss with respect to the model’s parameters. It involves propagating backward through the network, applying the chain rule to calculate the gradients, and using gradients to update parameters. This improves the neural network’s performance during training.

Input Layer Neurons: x₁ & x₂

Hidden Layer Neurons: h₁ & h₂

Output Layer Neurons: y₁ & y₂

Bias: b₁ & b₂

Weights: w₁, w₂, w₃, w₄, w₅, w₆, w₇ & w₈

Activation Function: It is a function that calculates the output of a neuron, based on its individual inputs and their corresponding weights. Here, I’ve set the activation function of every Hidden and Output layer neuron as the sigmoid activation function, for simplicity.

Forward Pass - Hidden Layer:

Calculate the inputs and outputs of each neuron in the hidden layer. The inputs to each neuron are the sum of the products of input neuron values with their corresponding weights. The outputs from each neuron are the input values of the neuron, subject to an Activation Function. Easy Peasy!

I’m starting with h₁.

The inputs to h₁ are, the neurons x₁ (with weight w₁), and x₂ (with weight w₂), and bias b₁. Plug the Input of h₁, into the activation function, and you have the Output of h₁!

Repeating the same for h₂, whose inputs are, the neurons x₁ (with weight w₃), and x₂ (with weight w₄), and bias b₁.

Forward Pass - Output Layer:

The procedure is exactly the same as what we did in the Hidden layer. The important thing is that, the outputs of h₁ and h₂ are the inputs to y₁ and y₂.

Therefore, the inputs to y₁ are, the Outputs of neurons h₁ (with weight w₅), and h₂ (with weight w₆), and bias b₂.

Repeating the same for y₂, whose inputs are, the Outputs of neurons h₁ (with weight w₇), and h₂ (with weight w₈), and bias b₂.

Total Error:

Now, we need to calculate the Total Error of our neural network. The metric I’m using is, the Mean Square Error (MSE).

Yaay! We’ve finished the Forward Pass!! 🥳✨

Backward Pass - Output Layer:

I’m starting with w₅.

From the diagram, we can see that w₅ affects the Output of y₁. We can also see that w₅ is linked to h₁.

Therefore, we can expect the involvement of h₁, w₅, y₁, in our final equation.

As a result, the contribution of w₅, to the Total Error, depends on
1. The contribution of the Output of y₁, to the Total Error.
2. The contribution of the Input of y₁, to the Output of y₁.
3. The contribution of w₅, to the Input of y₁.

Representing this mathematically, with the help of Partial Differential Equations (PDEs), and the Chain Rule, we have

Let’s deal with this, one term at a time.

Term 1:

Differentiate the Total Error, with respect to the Output of y₁. All terms unrelated to y₁ become zero.

Term 2:

Differentiate the Output of y₁ with respect to the Input of y₁. More details on how to differentiate the sigmoid activation function can be found here!

Term 3:

Differentiate the Input of y₁ with respect to w₅. All terms unrelated to w₅ become zero.

Putting it all together:

Updating the Weight:

The Updated w₅ is obtained by deducting the Total Error with respect to w₁, multiplied by a Learning Rate, from the old value of w₅.

TL;DR - Backward Pass - Output Layer:

Backward Pass - Hidden Layer:

I’m starting with w₁.

From the diagram, we can see that w₁ affects the Output of h₁, which in turn, affects the Output of y₁ and y₂. We can also see that w₁ is linked to x₁.

Therefore, we can expect the involvement of x₁, w₁, h₁, w₅, y₁, w₇, y₂, in our final equation.

As a result, the contribution of w₁, to the Total Error, depends on
1. The contribution of the Output of h₁, to the Total Error.
2. The contribution of the Input of h₁, to the Output of h₁.
3. The contribution of w₁, to the Input of y₁.

Representing this mathematically, with the help of Partial Differential Equations (PDEs), and the Chain Rule, we have

Let’s deal with this, one term at a time.

Term 1:

Since w₁ affects the Output of h₁, which in turn, affects the Output of y₁ and y₂, the Total Error with respect to the Output of h₁ is given by the Error of y₁ with respect to the Output of h₁, plus, the Error of y₂ with respect to the Output of h₁.

Term 4:

I’m splitting the Error of y₁ with respect to the Output of h₁, into Term 6 and Term 7, using the Chain Rule.

Term 6 = (Term 1 from Backward Pass - Output Layer) x (Term 2 from Backward Pass - Output Layer)

Term 7 is the Differential of the Input of y₁ with respect to the Output of h₁. All terms unrelated to the Output of h₁ become zero.

Put it all together, and we get the Error of y₁ with respect to the Output of h₁.

Term 5:

Repeating the same procedure as in Term 4, we get

Term 2:

Differentiate the Output of h₁ with respect to the Input of h₁. More details on how to differentiate the sigmoid activation function can be found here!