Neural Networks implementation from the ground up part 3 — Backpropagation

Published in

The Deep Hub

6 min readJul 13, 2024

Welcome back to the neural network series. It’s time for the main event!

This part will take you through how backpropagation can be implemented in a neural network. In the last blog, we covered the feedforward process with random wieghts and biases. In this part, we will see how can we nudge those weights and biases so that they move towards the expected output.

Time for some math first

A big preface: please go through the underlying derivations if you do not want to be overwhelmed with the math here

Let the madness begin.

We need to define some error function which we can minimise. Let’s define the error function as

(1)

Where y' is the neural network’s output and y is the actual output. We are halving this value just to keep it’s derivate simpler. here, y' and y can also be assumed as vectors, which will make E also a vector.

Our problem statement transforms to: Given E we need to figure out how to adjust Wij and Bij so that the next E is lower that the current E . The end goal is to reach the minimum possible value of E

We do a partial derivative on E:

where the Lamdba is the called the error term and lambda ^ L represents the error term of the layer L.

For the last layer, it is simply:

(3)

H^L denotes the output matrix of the last layer. In our example, L=2 is the last layer which we saw in the last blog.

We nudge the weights as follows:

(5)

and for biases:

(6)

2 things to note:

Why the negative of gradient? Checkout why a partial derivative always points to the direction of steepest ascent: first 25 videos from this playlist. And hence the negative of that will point to the direction of steepest descent, i.e the fastest way to minimise the cost.
Why do we need alpha? It helps us in reducing the error slowly, because if we do not scale down the partial derivative, it may overshoot the actual minima and actually start to increase the error. This is also called the learning rate.

Now, for the middle layers, we do not have y, so we need to leverage the already calculated error term (lambda) from the “next layer”. The error term for a middle layer is:

(7)

where g'(x) denotes the derivative of the activation function.

Again re-iterating, I will not be getting into the derivation details here, but the brilliant website has an excellent article which explains these from scratch.

Starting from the laster layer, plugging in (3) into equation (2), you will find the required adjustments for the last layer. Then repeatedly using (7) into (2) you can find out the adjustments for the middle layers.

As the error calculation and correction “moves” backwards through the network, this is called backpropagation.

CODE!

We add a function called backpropagation in theNeuralNetwork class which accepts the training data and the learning rate.

    private void backpropagation(Pair<Matrix, Matrix> trainingData, double learningRate) {
        // back prop - last layer's calculation is different from hidden layers
        Matrix outputLayerErrorTerm = backpropagationForLastLayer(trainingData, learningRate);
        Matrix nextLayerErrorTerm = outputLayerErrorTerm;
        outputErrorDiff = outputLayerErrorTerm;

        // process the hidden layers
        int i;
        for (i = layers - 2; i > 0; i--) {
            Matrix thisLayerErrorTerm =
                    layerOutputs
                            .get(i)
                            .apply(Functions::differentialSigmoid)
                            .dot(weights.get(i + 1).transpose().cross(nextLayerErrorTerm));
            adjustWeightsAndBiases(learningRate, i, thisLayerErrorTerm);

            nextLayerErrorTerm = thisLayerErrorTerm;
        }

        // for the first hidden layer, previous layer is the input. handle that accordingly
        backpropagationForSecondLayer(trainingData.getA(), nextLayerErrorTerm, learningRate);
    }

    private Matrix backpropagationForLastLayer(
            Pair<Matrix, Matrix> trainingData, double learningRate) {
        int layerInProcessing = layers - 1;
        Matrix outputLayerErrorTerm =
                layerOutputs.get(layerInProcessing).subtract(trainingData.getB());
        adjustWeightsAndBiases(learningRate, layerInProcessing, outputLayerErrorTerm);

        return outputLayerErrorTerm;
    }

    private void adjustWeightsAndBiases(double learningRate, int i, Matrix thisLayerErrorTerm) {
        Matrix deltaWeightI =
                thisLayerErrorTerm.cross(
                        layerOutputs.get(i - 1).apply(Functions::sigmoid).transpose());
        Matrix newWeights = weights.get(i).subtract(deltaWeightI.apply(x -> learningRate * x));
        weights.set(i, newWeights);

        Matrix newBiases = biases.get(i).subtract(thisLayerErrorTerm.apply(x -> learningRate * x));
        biases.set(i, newBiases);
    }

    private void backpropagationForSecondLayer(
            Matrix trainingData, Matrix nextLayerErrorTerm, double learningRate) {
        Matrix thisLayerErrorTerm =
                layerOutputs
                        .getFirst()
                        .apply(Functions::differentialSigmoid)
                        .dot(weights.get(1).transpose().cross(nextLayerErrorTerm));
        Matrix deltaWeightI = thisLayerErrorTerm.cross(trainingData.transpose());
        Matrix newWeights = weights.get(0).subtract(deltaWeightI.apply(x -> learningRate * x));
        weights.set(0, newWeights);

        Matrix newBiases =
                biases.getFirst().subtract(thisLayerErrorTerm.apply(x -> learningRate * x));
        biases.set(0, newBiases);
    }

Notice how the calculation for error term becomes simple because we’re using immutable Matrices, and returning new ones after any operation:

Matrix thisLayerErrorTerm = layerOutputs
                              .get(i)
                              .apply(Functions::differentialSigmoid)
                              .dot(weights.get(i + 1).transpose().cross(nextLayerErrorTerm));

This essentially shows the implementation for equation (7)

Now that we have the back propagation ready, we can actually start training on our data. Let’s tie up both the feedforward and the backpropagation flows:

    public void trainForOneInput(Pair<Matrix, Matrix> trainingData, double learningRate) {
        feedforward(trainingData.getA());
        backpropagation(trainingData, learningRate);
    }

In this example, for each feedforward, there will be one backpropagation. This is different from conventional online tutorials where feedforward is on n different inputs, and then one iteration of backpropagation happens on the average on those n error.

Example

Lets continue our previous example where we were trying to train on network to tell whether a 5 digit binary number is divisible by 3.

public static void main(String[] args) throws IOException {
        List<Pair<Matrix, Matrix>> trainingData = List.of(
                Pair.of(new Matrix(new double[][]{{0, 1, 1, 1, 0}}).transpose(), new Matrix(new double[][]{{0, 1}}).transpose()), //14
                Pair.of(new Matrix(new double[][]{{0, 1, 0, 0, 1}}).transpose(), new Matrix(new double[][]{{1, 0}}).transpose()), //9
                Pair.of(new Matrix(new double[][]{{1, 0, 1, 1, 0}}).transpose(), new Matrix(new double[][]{{0, 1}}).transpose()), //22
                Pair.of(new Matrix(new double[][]{{1, 1, 0, 0, 0}}).transpose(), new Matrix(new double[][]{{1, 0}}).transpose()), //24
                Pair.of(new Matrix(new double[][]{{1, 0, 0, 0, 0}}).transpose(), new Matrix(new double[][]{{0, 1}}).transpose()), //16
                Pair.of(new Matrix(new double[][]{{1, 1, 1, 1, 1}}).transpose(), new Matrix(new double[][]{{0, 1}}).transpose()), //31
                Pair.of(new Matrix(new double[][]{{0, 1, 1, 1, 1}}).transpose(), new Matrix(new double[][]{{1, 0}}).transpose()), //15
                Pair.of(new Matrix(new double[][]{{0, 0, 0, 1, 1}}).transpose(), new Matrix(new double[][]{{1, 0}}).transpose()), //3
                Pair.of(new Matrix(new double[][]{{0, 0, 1, 0, 0}}).transpose(), new Matrix(new double[][]{{0, 1}}).transpose()) //4
        );

        NeuralNetwork neuralNetwork = NNBuilder.create(5, 2, List.of(3, 3));

        for (int t = 0; t < 100; t++) {
            double error = 0;
            for (Pair<Matrix, Matrix> p : trainingData) {
                neuralNetwork.trainForOneInput(p, 0.1);
                double errorAdditionTerm =
                        neuralNetwork.getOutputErrorDiff().apply(x -> x * x).sum()
                                / trainingData.size();
                error += errorAdditionTerm;
            }

            if ((t == 0) || ((t + 1) % 10 == 0)) {
                System.out.println("after " + (t + 1) + " epochs, average error: " + error);
            }

            trainingData = MathUtils.shuffle(trainingData);
        }
    }

where MathUtils.shuffle randomly shuffles the list to add some more randomness to training, and prevent learning cycles in the network.

Output shows the error after each iteration (only showing the last 10)

after 1 epochs, average error: 0.9629180864940948
after 10 epochs, average error: 0.5746570723596827
after 20 epochs, average error: 0.5414166775340912
after 30 epochs, average error: 0.5579636049797371
after 40 epochs, average error: 0.5033812777536886
after 50 epochs, average error: 0.5250070917059113
after 60 epochs, average error: 0.4463143328593298
after 70 epochs, average error: 0.45224944047562077
after 80 epochs, average error: 0.4010041333405863
after 90 epochs, average error: 0.3382820133296076
after 100 epochs, average error: 0.2682029780437851

As it is evident that the error is decreasing after each iteration, we can say that our weight adjustments are working correctly and hence the network is ‘learning’. yaay!

But this example is too lame. In the next article, we’ll be using this network to train on the MNIST dataset and see if our network actually works.

Neural Networks implementation from the ground up part 3 — Backpropagation

Time for some math first

CODE!

Example

Resources

Written by Satvik Nema