Demystifying the Math Behind Neural Nets | Towards AI

One LEGO at a Time: Explaining the Math of how Neural Networks Learn with Implementation from Scratch

Omar U. Florez
Jun 1 · 9 min read

Neural Networks as a Composition of Pieces

  • Weights W1 maps input X to the first hidden layer h1. Weights W1 works then as a linear kernel
  • A Sigmoid function prevents numbers in the hidden layer from falling out of range by scaling them to 0–1. The result is an array of neural activations h1 = Sigmoid(WX)

Why should I read this?

  • If you observe NaN predictions, the algorithm may have received large gradients producing memory overflow. Think of this as consecutive matrix multiplications that explode after many iterations. Decreasing the learning rate will have the effect of scaling down these values. Reducing the number of layers will decrease the number of multiplications. And clipping gradients will control this problem explicitly

Concrete Example: Learning the XOR Function

  • Weights W1 is a 2x3 matrix with randomly initialized values
  • The hidden layer h1 consists of three neurons. Each neuron receives as input a weighted sum of observations, this is the inner product highlighted in green in the below figure: z1 = [x1, x2][w1, w2]
  • Weights W2 is a 3x2 matrix with randomly initialized values and
  • Output layer h2 consists of two neurons, since the XOR function returns either 0 (y1=[0,1]) or 1 (y2 = [1,0])

Network Initialization

Forward Step:

Computing the Total Loss

Backward step:


The chain rule says that we can decompose the computation of gradients of a neural network into differentiable pieces:


Computing the chain rule for updating the weights of the first hidden layer W1 exhibits the possibility of reusing existing computations.


Let’s translate the above mathematical equations to code only using Numpy as our linear algebra engine. Neural networks are trained in a loop in which each iteration present already calibrated input data to the network. In this small example, let’s just consider the entire dataset in each iteration. The computations of Forward step, Loss, and Backward step lead to good generalization since we update the trainable parameters (matrices w1 and w2 in the code) with their corresponding gradients (matrices dL_dw1 and dL_dw2) in every cycle. Code is stored in this repository:

Let’s Run This!

See below some neural networks trained to approximate the XOR function over many iterations.

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Omar U. Florez

Written by

Senior Research Manager in AI at Capital One - Conversational AI Research team. Teaching computers to see, read, and understand | Views & opinions are my own

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.