Maths Story behind Back-Propagation

MUHAMMAD ATEEB TASEER
Technological Singularity
5 min readApr 17, 2024

Unraveling the Maths: The Essential Mechanics of Backpropagation

As I mentioned in my previous post, what if your model’s equation differs from mine? Would the principles of Hebb’s learning or the ADALINE rule still be applicable? Or is it necessary to establish a generalized approach that applies to any model?
Yes, Generalization was all we needed, when Learning representations by back-propagating errors was published in 1986. With backpropagation, it became possible to use neural networks to solve problems that had previously been insoluble. Today, backpropagation is the workhorse of learning in neural networks. Without it, we would waste both time and energy. It all boils down to applying the chain rule in forward versus reverse accumulation mode.

Forward and Reverse Accumulation Modes:

Suppose we have a model equation/function :

Let us decompose the function with the help of intermediate variables:

To compute the derivative dy/dx, we can traverse the chain rule

  1. from inside to outside, or
  2. from outside to inside.

Chain Rule:

So we know Chain Rule; Lets start with an inside first traversal of the chain rule, i.e., the forward accumulation mode:

On the other hand, the reverse accumulation mode performs an outside first traversal of the chain rule, which more commonly is known as backpropagation:

Both methods reach at:

using the same number of computations; however, this is not always the case, as we soon will find out.

Note that the forward accumulation mode computes the recurrence relation as:

In contrast, the reverse accumulation mode computes the recurrence relation as:

Now, let us move on to a function , where it will be easier to analyze the computational complexity of the forward and reverse accumulation modes.

Now, let us move on to a function f : R3 → R2, where it will be easier to analyze the computational complexity of the forward and reverse accumulation modes.

Example:

To make a good comparison, we need an example with a different number of dependent variables than independent variables. The following function fulfills that requirement:

Next, to make gradient computations as simple as possible, after decomposition, lets make sure we are left with only straightforward arithmetic operations and elementary functions:

Partial derivatives ∂y1/∂x1, ∂y1/∂x2, and ∂y1/∂x3, ∂y2/∂x1, ∂y2/∂x2, and ∂y2/∂x3 can be computed using the chain rule starting with an inside first traversal.

Starting with an inside first traversal.

The Forward Accumulation Mode:

Iteration 1:

Computing the partial derivative of every intermediate variable once gives us :
∂y1/∂x1 = x2 — x3,
∂y2/∂x1 = -x3 / (1 — x1).

as we know;

This was all hideous, now lets do other Calculations using the power of ChatGPT.

Note: In forward mode, you start from the independent variables and move towards the dependent variables, calculating the derivatives as you go along. At each iteration, you increase the independent variables one at a time and observe the changes in the dependent variables i.e.Start from the independent variables and see how they affect the dependent variable.

Before drawing any conclusions, let us work through the same example again. This time around, we will perform an outside first traversal of the chain rule, which we call “ Reverse Accumulation Mode”

The Reverse Accumulation Mode:

Using ChatGPT:

“Behold the power of backpropagation! Computing the partial derivative with respect to every intermediate variable once gives us
∂y1/∂x1 = x2 — x3,
∂y1/∂x2 = x1,
∂y1/∂x3 = -x1.

Stop! Read and Understand these Notes Carefully to understand the actual Application of Back-propagation.

Note: As we are going from outside(dependent variable) to inside(independent variables), Dependent variable across each iteration remains same while Independent Variable gets iterates, as we are finding change in Dependent Variable for each Independent Variable in an Iteration .

In each iteration of reverse mode (backpropagation), you calculate how changes in the dependent variable (usually the loss or output) are affected by changes in the intermediate variables and then by the independent variables (usually the inputs or weights) i.e. Start from the dependent variable (output) and trace back how each part of the function contributed to its change, step by step, in reverse.

A second and final iteration concludes with :

∂y2/∂x3 = log(1 — x1),
∂y2/∂x1 = -x3 / (1 — x1),
∂y2/∂x2 = 0,
∂y2/∂x3 = log(1 — x1).

Do you start to recognize any patterns?

Computational Complexity:

Analyzing the pen-and-paper example, in the forward accumulation mode, we needed three iterations because we had three independent variables(AKA three inputs). On the other hand, in the reverse accumulation mode, we only needed two iterations because we had two dependent variables (AKA Two Outputs).

Words To Notice!!!

In closing, deep learning models may very well have trainable parameters in the millions but always only one cost function; hence, we always work with problems where n≫m=1, which is where backpropagation excels. Now, do you understand how backpropagation is able to reduce the time spent on computing gradients? 🏎

Lets see Forward and Backward Propagations wrt Neural Netwroks in Depth in next blog.

--

--