Machine Learning, Mathematics

One LEGO at a Time: Explaining the Math of how Neural Networks Learn with Implementation from Scratch

Omar U. Florez
Jun 1, 2019 · 9 min read

A neural network is a clever arrangement of linear and non-linear modules. When we choose and connect them wisely, we have a powerful tool to approximate any mathematical function. For example one that separates classes with a non-linear decision boundary.

Steps to run the code:

A topic that is not always explained in-depth, despite its intuitive and modular nature, is the backpropagation technique responsible for updating trainable parameters. Let’s build a neural network from scratch to see the internal functioning of a neural network using LEGO pieces as a modular analogy, one brick at a time.

Neural Networks as a Composition of Pieces

Image for post
Image for post

The above figure depicts some of the Math used for training a neural network. We will make sense of this during this article. The reader may find interesting that a neural network is a stack of modules with different purposes:

At this point, these operations only compute a general linear system, which doesn’t have the capacity to model non-linear interactions. This changes when we stack one more layer, adding depth to this modular structure. The deeper the network, the more subtle non-linear interactions we can learn and more complex problems we can solve, which may explain in part the rise of deep neural models.

Why should I read this?

If you understand the internal parts of a neural network, you will quickly know what to change first when things don’t work and define an strategy to test invariants and expected behaviors that you know are part the algorithm. This will also be helpful when you want to create new capabilities that are not currently implemented in the ML library you are using.

Because debugging machine learning models is a complex task. By experience, mathematical models don’t work as expected the first try. They may give you low accuracy for new data, spend long training time or too much memory, return a large number of false negatives or NaN predictions, etc. Let me show some cases when knowing how the algorithm works can become handy:

Concrete Example: Learning the XOR Function

Let’s open the blackbox. We will build now a neural network from scratch that learns the XOR function. The choice of this non-linear function is by no means random chance. Without back-propagation it would be hard to learn to separate classes with a straight line.

To illustrate this important concept, note below how a straight line cannot separate 0s and 1s, the outputs of the XOR function. Real-life problems are also non-linearly separable.

Image for post
Image for post

The topology of the network is simple:

More visually:

Image for post
Image for post

Let’s now train the model. In our simple example the trainable parameters are weights, but be aware that current research is exploring more types of parameters to be optimized. For example shortcuts between layers, regularized distributions, topologies, residual, learning rates, etc.

Backpropagation is a method to update the weights towards the direction (gradient) that minimizes a predefined error metric known as Loss function given a batch of labeled observations. This algorithm has been repeatedly rediscovered and is a special case of a more general technique called automatic differentiation in reverse accumulation mode.

Network Initialization

Let’s initialize the network weights with random numbers.

Image for post
Image for post

Forward Step:

This goal of this step is to forward propagate the input X to each layer of the network until computing a vector in the output layer h2.

This is how it happens:

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Computing the Total Loss

Also known as “actual minus predicted”, the goal of the loss function is to quantify the distance between the predicted vector h2 and the actual label provided by humans y.

Note that the Loss function contains a regularization component that penalizes large weight values as in a Ridge regression. In other words, large squared weights values will increase the Loss function, an error metric we indeed want to minimize.

Image for post
Image for post

Backward step:

The goal of this step is to update the weights of the neural network in a direction that minimizes its Loss function. As we will see, this is a recursive algorithm, which can reuse gradients previously computed and heavily relies on differentiable functions. Since these updates reduce the loss function, a network ‘learns’ to approximate the label of observations with known classes. A property called generalization.

This step goes in backward order than the forward step. It computes first the partial derivative of the loss function with respect to the weights of the output layer (dLoss/dW2) and then the hidden layer (dLoss/dW1). Let’s explain in detail each one.

dLoss/dW2:

The chain rule says that we can decompose the computation of gradients of a neural network into differentiable pieces:

Image for post
Image for post

As a memory helper, these are the function definitions used above and their first derivatives:

Image for post
Image for post

More visually, we aim to update the weights W2 (in blue) in the below figure. In order to that, we need to compute three partial derivatives along the chain.

Image for post
Image for post

Plugging in values into these partial derivatives allow us to compute gradients with respect to weights W2 as follows.

Image for post
Image for post

The result is a 3x2 matrix dLoss/dW2, which will update the original W2 values in a direction that minimizes the Loss function.

Image for post
Image for post

dLoss/dW1:

Computing the chain rule for updating the weights of the first hidden layer W1 exhibits the possibility of reusing existing computations.

Image for post
Image for post

More visually, the path from the output layer to the weights W1 touches partial derivatives already computed in latter layers.

Image for post
Image for post

For example, partial derivatives dLoss/dh2 and dh2/dz2 have been already computed as a dependency for learning weights of the output layer dLoss/dW2 in the previous section.

Image for post
Image for post

Placing all derivatives together, we can execute the chain rule again to update the weights of the hidden layer W1:

Image for post
Image for post

Finally, we assign the new values to the weights and have completed a training step on the network.

Image for post
Image for post

Implementation

Let’s translate the above mathematical equations to code only using Numpy as our linear algebra engine. Neural networks are trained in a loop in which each iteration present already calibrated input data to the network. In this small example, let’s just consider the entire dataset in each iteration. The computations of Forward step, Loss, and Backward step lead to good generalization since we update the trainable parameters (matrices w1 and w2 in the code) with their corresponding gradients (matrices dL_dw1 and dL_dw2) in every cycle. Code is stored in this repository: https://github.com/omar-florez/scratch_mlp

Image for post
Image for post

Let’s Run This!

See below some neural networks trained to approximate the XOR function over many iterations.

Left plot: Accuracy. Central plot: Learned decision boundary. Right plot: Loss function.

First, let’s see how a neural network with 3 neurons in the hidden layer has a small capacity. This model learns to separate 2 classes with a simple decision boundary that starts being a straight line but then shows a non-linear behavior. The loss function in the right plot nicely gets low as training continues.

Image for post
Image for post

Having 50 neurons in the hidden layer notably increases the model’s power to learn more complex decision boundaries. This could not only produce more accurate results but also explode gradients, a notable problem when training neural networks. This happens when very large gradients, multiply weights during backpropagation and thus generate large updated weights. This is the reason why the Loss value suddenly increases during the last steps of the training (step > 90). The regularization component of the Loss function computes the squared values of weights that are already very large (sum(W²)/2N).

This problem can be avoided by reducing the learning rate as you can see below. Or by implementing a policy that reduces the learning rate over time. Or by enforcing a stronger regularization, maybe L1 instead of L2. Exploding and vanishing gradients are interesting phenomenons and we will devote an entire analysis later.

Towards AI — Multidisciplinary Science Journal

The Best of Tech, Science and Engineering.

Sign up for Towards AI Newsletter

By Towards AI — Multidisciplinary Science Journal

Towards AI publishes the best of tech, science, and engineering. Subscribe with us to receive our newsletter right on your inbox. Take a look

Create a free Medium account to get Towards AI Newsletter in your inbox.

Omar U. Florez

Written by

Senior Research Manager in AI at Capital One - Conversational AI Research team. Teaching computers to see, read, and understand | Views & opinions are my own

Towards AI — Multidisciplinary Science Journal

Towards AI is a world’s leading multidisciplinary science journal. Towards AI publishes the best of tech, science, and engineering. Read by thought-leaders and decision-makers around the world.

Omar U. Florez

Written by

Senior Research Manager in AI at Capital One - Conversational AI Research team. Teaching computers to see, read, and understand | Views & opinions are my own

Towards AI — Multidisciplinary Science Journal

Towards AI is a world’s leading multidisciplinary science journal. Towards AI publishes the best of tech, science, and engineering. Read by thought-leaders and decision-makers around the world.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store