Training an MLP From Scratch Using Backpropagation for Solving Mathematical Equations

Subodh Lonkar
The Startup
Published in
8 min readAug 10, 2020

This post demonstrates the concept and use of backpropagation for solving a mathematical equation.

First of all, why should we even care about Neural Networks?

Well, we do have a biological inspiration for that! Rigorous studies in biology in the past have revealed that the human brain consists of connected structures of neurons. Each neuron performs a specific task, and millions of such neurons together, are capable of performing highly complex tasks. First simplest Neural Network Model call the Perceptron was built in 1958 by Rosenblatt.
The Perceptron is an algorithm for learning a binary classifier called a threshold function: a function that maps its input ‘x’(a real-valued vector) to an output value f(x)(a single binary value). The value of f(x)(0 or 1) is used to classify x as either a positive or a negative instance. Perceptron is this single neuron which is loosely inspired from the biological neuron and is not an exact replica but is powerful enough to solve many interesting problems.

But for performing complex tasks, a bunch of such Perceptrons need to work together.

Multi-Layered Perceptron

In Multi-Layered Perceptrons, bunch of single neurons are stacked to form a layer and many such layers are stacked to form a network of neurons. This can be thought of as a function of functions, analogous to the composite functions we learnt in high school. Thus, with MLP we can have complex functions to act on inputs to get the output. MLPs can easily overfit, thus regularizers are applied to avoid overfitting. MLP can be thought of as a graphical way of representing functional compositions.

Consider a simple equation where we perform some basic arithmetic operations:

y = (x1 + x2) * (x3 — x4)

Here we are using 3 arithmetic operations: addition, subtraction and multiplication.
Let’s write 3 functions add(), sub()and mul() which perform addition, subtraction and multiplication respectively.
The above equation can now be written as follows:

y = mul(add(x1, x2), sub(x3, x4))

Computational graph for this equation can be constructed as shown below:

Backpropagation

Backpropagation can simply be thought of as Chain Rule + Memoization. Chain rule refers to the one we generally use in differentiation. Memoization refers to storing frequently used values. This reduces efforts of performing same computations again and again.

As per Wikipedia, “Backpropagation is a widely used algorithm in training feed forward neural networks for supervised learning. The backpropagation algorithm works by computing the gradient of the loss function with respect to each weight by the chain rule, computing the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule; this is an example of dynamic programming.”

One very important thing to note is that for backpropagation to work, our Activation Functions must be differentiable. Functions which are easy/fast to differentiate speeds up the overall process as it takes less time for computing derivatives.

Having got a taste of some key concepts, lets jump in to see how the idea of backpropagation can be used for solving a complex mathematical equation.

Consider the equation shown below:

As previously seen, we can write the above equation as:

q = add{exp{sin[sub[(mul(add(w1, f1), mul(w, f2))), (add(add(w1, f1), add(w2, f2)))]}, exp{add(sigmoid(mul(w3, f3)), w5)},
exp{add(sigmoid(add(w4, f4)), w6)}}

Let’s just say y_pred = q.

Phew! This seems complex. But fortunately, it’s not:) Refer the computational graph below for better understanding.

We can think of each node or neuron as one of the functions we are using in this equation.

Here we have 4 dimensional inputs and a weight vector of size 6. Loss used is squared loss. We need to find the best possible solution by reducing the loss as much as possible. Flow of the Gradient Descent(one can use any among GD, SGD or mini batch SGD as per his\her choice, although mini batch SGD is preferred by many) algorithm to solve the optimization problem looks as shown below.

Given D = {x_i, y_i} // x_i is the input data point and y_i is the corresponding class label.1. Initialize weights → randomly (there are effective initialization  methods available)
2. For each x_i:
→ Pass x_i forward through network:
— → Get y_i_hat
— → This is forward propagation
→ Compute loss
→ Compute derivatives using chain rule and memoization
→ Update weights from end to start
3. Repeat step 2 until convergence, i.e. W_new ~ W_old

Forward propagation → Using x_i to calculate y_i and L
Backward propagation → Using L to update weights
Both combine to form an epoch.

We will be using numpy which can be imported as follows:

import numpy as np

Let’s have a look at the forward pass or forward propagation.

Forward propagation can be considered as the process of moving from left to right. We calculate y_pred and Loss from x_i in this pass.

def forward_pass(x, y, w):
'''In this function, we will compute the forward propagation '''
# x: input data point, note that in current equation we are having
4-d data points
# y: output variable
# w: weight array, its of length 6, w[0] corresponds to w1 in
graph, w[1] corresponds to w2 in graph,..., w[5] corresponds to
w6 in computational graph.

a = w[0] + x[0]
b = w[1] + x[1]
c = a * b
d = a + b
e = c - d
f = sin(e)
g = math.e(f)
h = w[2] * x[2]
i = sigmoid(h)
k = w[4] + i
m = math.e(k)
j = w[3] + x[3]
l = sigmoid(j)
n = w[5] + l
o = math.e(n)
p = m * o
q = g + p
r = q = y'

L = (y - y_pred) ** 2

dldy_pred = -2 * (y - y_pred)
dl = dldy_pred

dictionary = {
"a": a, "b": b, "c": c, "d": d, "e": e, "f": f, "g": g, "h": h,
"i": i, "j": j, "k": k, "l": l, "m": m, "n": n, "o": o, "p": p,
"q": q, "r": r, "y_pred": q, "loss": L, "dl": dl,"Loss_Function" :
(y - y_pred) ** 2, "dldy_pred": -2 * (y - y_pred)
}

return (dictionary)

Sigmoid function we used above can be written as follows:

def sigmoid():
'''In this function, we will compute the sigmoid(z)'''
# we can use this function in forward and backward propagation
return 1 / (1 + np.e**(-z))

Now let’s look the backward pass or backward propagation.

Backward propagation can be considered as the process of moving from right to left, that is backwards for updating weights using Loss L. Remember, we need to find optimal weights to get the best result.

Notation: dydx = differentation of y w.r.t. x.

Here, we differentate of L w.r.t all 6 weights, that is we compute dLdw1, dLdw2, dLdw3, dLdw4, dLdw5 and dLdw6. We can easily do it as follows:

dqdg = 1
dqdf = dqdg * dgdf = 1 * np.e**(f) = np.e**(f)
dqde = dqdf * dfde = np.e**(f) * np.cos(e)
dqdc = dqde * dedc = [np.e**(f) * np.cos(e)] * 1 = np.e**(f) * np.cos(e)
dqdd = dqde * dedd = [np.e**(f) * np.cos(e)] * (-1) = -[np.e**(f) * np.cos(e)]
dqda = dqde * (dedc * dcda + dedd * ddda) = [np.e**(f) * np.cos(e)] * (b- 1)
dqdb = dqde * (dedc * dcdb + dedd * dddb) = [np.e**(f) * np.cos(e)] * (a-1)
dqdw1 = dqda * dadw1 = {[np.e**(f) * np.cos(e)] * (b-1)} * 1 = [np.e**(f) * np.cos(e)] * (b-1)
dqdw2 = dqdb * dbdw2 = {[np.e**(f) * np.cos(e)] * (a-1)} * f2

dqdp = 1
dqdm = dqdp * dpdm = 1 * o = o
dqdk = dqdm * dmdk = o * np.e**(k)
dqdi = dqdk * dkdi = (o * np.e**(k)) * 1 = o * np.e**(k)
dqdh = dqdi * didh = (o * np.e**(k)) * [sigmoid(h) * (1-sigmoid(h))]
dqdw3 = dqdh * dhdw3 = (o * np.e**(k)) * [sigmoid(h) * (1-sigmoid(h))] *f3
dqdw5 = dqdk * dkdw5 = (o * np.e**(k)) * 1 = o * np.e**(k)
dqdo = dqdp * dpdo = 1 * m = m
dqdn = dqdo * dodn = m * np.e**(n)
dqdw6 = dqdn * dndw6 = (m * np.e**(n)) * 1 = m * np.e**(n)
dqdl = dqdn * dndl = (m * np.e**(n)) * 1 = m * np.e**(n)
dqdj = dqdl * dldj = (m * np.e**(n)) * [sigmoid(j)(1-sigmoid(j))] * 1 = (m * np.e**(n)) * [sigmoid(j)(1-sigmoid(j))]

Thus, we can write the backward_pass function as:

def backward_pass(x, w, d):
'''In this function, we will compute the backward propagation '''
# L: the loss we calculated for the current point
# dictionary: the outputs of the forward_pass() function
# code to compute the gradients of each weight [w1,w2,w3,...,w6]
dqdg = 1
dqdf = np.e**(d["f"])
dqde = np.e**(d["f"]) * np.cos(d["e"])
dqdc = np.e**(d["f"]) * np.cos(d["e"])
dqdd = -[np.e**(d["f"]) * np.cos(d["e"])]
dqda = [np.e**(d["f"]) * np.cos(d["e"])] * (d["b"] - 1)
dqdb = [np.e**(d["f"]) * np.cos(d["e"])] * (d["a"] - 1)
dqdw1 = [np.e**(d["f"]) * np.cos(d["e"])] * (d["b"] - 1)
dqdw2 = [np.e**(d["f"]) * np.cos(d["e"])] * (d["a"] - 1) * x[1]

dqdp = 1
dqdm = d["o"]
dqdk = d["o"] * np.e**(d["k"])
dqdi = d["o"] * np.e**(d["k"])
dqdh = (d["o"] * np.e**(d["k"])) * [sigmoid(d["h"]) *
(1 - sigmoid(d["h"]))]
dqdw3 = (d["o"] * np.e**(d["k"])) * [sigmoid(d["h"]) * (1 -
sigmoid(d["h"]))] *x[2]
dqdw5 = d["o"] * np.e**(d["k"])
dqdo = d["m"]
dqdn = d["m"] * np.e**(d["n"])
dqdw6 = d["m"] * np.e**(d["n"])
dqdl = d["m"] * np.e**(d["n"])
dqdj = (d["m"] * np.e**(d["mn"])) * [sigmoid(d["j"])(1 -
sigmoid(d["j"]))]

dLdw1 = d["dl"] * dqdw1
dLdw2 = d["dl"] * dqdw2
dLdw3 = d["dl"] * dqdw3
dLdw4 = d["dl"] * dqdw4
dLdw5 = d["dl"] * dqdw5
dLdw6 = d["dl"] * dqdw6

dW = {
"dw1": dLdw1, "dw2": dLdw2, "dw3": dLdw3, "dw4": dLdw4, "dw5":
dLdw5, "dw6": dLdw6
}

# return dW, dW is a dictionary with gradients of all the weights
return dW

We can do sanity check by checking all the return values after performing gradient checking, they have to be zero. We haven’t discussed details of gradient checking in this blog.

Thus, this is how we can convert a mathematical equation into computational graph and write it as a function of function(s), i.e. using the concept of composite functions and then find optimal edge weights using backpropagation.

If you like this article, don’t forget to leave a “clap”!

Thank you for your time!

--

--

Subodh Lonkar
The Startup

Data Scientist passionate about leveraging cutting-edge technology for solving real world business problems. Portfolio: https://learner-subodh.github.io/.