Back-Propagation is very simple. Who made it Complicated ?

Learning Outcome: You will be able to build your own Neural Network on a Paper.

Feed Forward Neural Network
Signup for my AI newsletter

More about why is it important to Understand?

Andrej Karapathy wrote a blog-post on it and I found it useful.


  • Build a small neural network as defined in the architecture below.
  • Initialize the weights and bias randomly.
  • Fix the input and output.
  • Forward pass the inputs. calculate the cost.
  • compute the gradients and errors.
  • Backprop and adjust the weights and bias accordingly


  • Build a Feed Forward neural network with 2 hidden layers. All the layers will have 3 Neurons each.
  • 1st and 2nd hidden layer will have Relu and sigmoid respectively as activation functions. Final layer will have Softmax.
  • Error is calculated using cross-entropy.

Initializing the network

I have taken inputs, weights and bias randomly

Initializing the network


Neural Network Layer-1
Layer-1 Matrix Operation
Layer-1 Relu Operation
Layer-1 Example


Layer-2 Neural Network
Layer-2 Matrix Operation
Sigmoid Operation
Layer-2 Example


Layer-3 Neural Network
Layer-3 Matrix Operation
Softmax formula
Layer-3 Output Example
  • The Actual Output should be [1.0, 0.0, 0.0] but we got [0.2698, 0.3223, 0.4078].
  • To calculate error lets use cross-entropy



Cross-Entropy Formula
Cross-Entropy calculation

Important Derivatives:


Derivative of Sigmoid


Derivative of Relu


Derivative of Softmax

BackPropagating the error — (Hidden Layer2 — Output Layer) Weights

Backpropagating Layer-3 weights
Example: Derivative of Cross-Entropy
Matrix of cross-entropy derivatives wrt output
values of derivative of cross-entropy wrt output.
Example: Derivative of softmax wrt output layer input
Matrix of Derivative of softmax wrt output layer input.
values of derivative of softmax wrt output layer input .
Example: Derivative of input to output layer wrt weight
values of derivative of input to output layer wrt weights.
Weight from k1 to l1 neuron
Derivative of error wrt weight
Chain rule breakdown of Error derivative
Matrix Form of all derivatives in layer-3
Example: Calculation of all the values
Modified weights of kl neurons after backprop

BackPropagating the error — (Hidden Layer1 — Hidden Layer 2) Weights

Backpropagating errors to 2nd layer
Example: Derivative of sigmoid output wrt layer 2 input
Values of derivative of output of layer-2 wrt input of layer1
Derivative of layer 2 input wrt weight
Values of derivative layer 2 input wrt of weight
weight from j3 to k1
Derivative of Error wrt weight j3-k1
chain rule of derivative of error wrt weight
final matrix of derivatives of weights_{jk}
Breakdown of error.
breakdown of each error derivative
Derivative of error wrt output of hidden layer 2
Derivative of input of output layer wrt hidden layer -2
Final Matrix of derivative of total error wrt output of hidden layer-2
calculations using an example
Our final matrix of derivatives of Weights connecting hidden layer-1 and hidden layer-2
Calculations from our examples
Final modified matrix of W_{jk}

BackPropagating the error — (Input Layer — Hidden Layer 1) Weights.

Edit:1 the following calculations from here are wrong. I took only wj1k1 and ignored wj1k2 and wj1k3. This was pointed by an user in comments. I would like someone to edit the jupyter notebook attached at the end. Please refer to some other implementations if u still didn’t understand back-prop here.

Backpropagating errors to 1st layer
Derivative of hidden layer 1 output wrt to its input
Derivative of input to hidden layer wrt to weights
Final derivative calculations
Weight connecting i2 neuron to j1
derivative of error wrt to weight
chain rule for calculating error
Final matrix using symmertry
Watching for the first term
calculations in our example
Final matrix
calculations in our example
using learning rate we get final matrix

The End of Calculations

Our Initial Weights:

Our Final Weights:

Important Notes:

  • I have completely eliminated bias when differentiating. Do you know why ?
  • Backprop of bias should be straightforward. Try on your own.
  • I have taken only one example. What will happen if we take batch of examples?
  • Though I have not mentioned directly about vanishing gradients. Do you see why it occurs?
  • What would happen if all the weights are the same number instead of random ?

References Used:

Code available on Github :



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store