Implementing batched Backpropagation(from scratch)

Harsh-Sensei
7 min readDec 15, 2022

--

In this blog, I would like to go through implementation of backpropagation algorithm as used in deep learning libraries like PyTorch, Tensorflow, Jax, etc. Through this, I aim to give better insights into the backpropagation algorithm and possibly admire the ease that the above mentioned libraries have brought to this realm of deep learning, which is often taken for granted ;) .

Source : One Piece

We will be using Numpy library for matrix operations and all implementation(forward pass and backward pass) is carried our for batched inputs. Much of the math used in this blog can be found in these stanford slides. I would highly recommend going through these slides, in case anything is unclear in this blog.

Below is the incomplete code snippet(in assignment.py) we will using as base code, and we will go into details of implementation of each function.

Now, there are already many online resources explaining the math behind backpropagation, few of which are listed below.

The focus of this blog would rather be on having batched inputs without using for loops(that is, vectorized computations). Lets go…

FlattenLayer

class FlattenLayer:
'''
This class converts a multi-dimensional into 1-d vector
'''
def __init__(self, input_shape):
self.input_shape = input_shape

def forward(self, input):
## TODO
return input
## END TODO


def backward(self, output_error):
## TODO
return output_error
## END TODO

forward would require reshaping the batched input to its flattened form. So, we would preserve the first dimension and flatten the rest.

def forward(self, input):
input = input.reshape(self.input_shape[0], -1) # -1 would flatten the remaining dimensions
return input

For backward, the error just needs to be reshaped to original shape(the input_shape). Hence,

def backward(self, output_error):
## TODO
return output_error.reshape(self.input_shape)
## END TODO
Easy, huh? (Source : Bleach)

Alright, moving to the next class…

class FCLayer:
'''
Implements a fully connected layer
'''
def __init__(self, input_size, output_size):
self.input_size = input_size
self.output_size = output_size
self.weights = np.random.randn(input_size, output_size)
self.bias = np.random.randn(1, output_size)

def forward(self, input):
## TODO
return None
## END TODO

def backward(self, output_error, learning_rate):
## TODO
return None
## END TODO

forward requires carrying out affine transformation on the batched input. That is, multiplication by weights and then adding bias to the product. Note that we also need to save the input because it is required in backward function.

def forward(self, input):
self.X = input.copy()
return np.matmul(input, self.weights) + self.bias

backward is different than for FlattenLayer. Here we need to update the weights and then propagate(return) the gradients back.

    def backward(self, output_error, learning_rate):
ret_error = output_error@self.weights.T
self.weights = self.weights - learning_rate*(self.X.T@output_error)
self.bias = self.bias - learning_rate*output_error
return ret_error

To understand this, we need to know some matrix calculus. But there are some tricks which can be used to get the derivatives of matrix products easily. Let Y = X*W. Shapes of thee matrices are:- Y : b*c ; X : b*a ; W : a*c

Consider these 2 important results(or thumb rules):

  1. Derivative of a scalar w.r.t. a matrix M is a matrix of same dimensions as M.
  2. The derivative in case of linear transformations like, matrix multiplication in above case, can be found by matching shapes of matrices(see the example below to understand this).

Using these 2 results, we know output_error is of the shape b*c. We need derivative w.r.t. W to update W. Trying to match shapes, we need a*c using X and output_error. This can be obtained by transposing X and post-multiplying by output_error. Further, we need to find derivative w.r.t. X and propagate it backwards(here, just return it). Again, using similar analysis as above, gives ret_error as output_error post-multiplied by transpose of W.

Still easy, huh? (Source : One piece)

From now onwards, I am skipping the #TODO codes and directly moving to the complete code snippets.

ActivationLayer

class ActivationLayer:
'''
Implements a Activation layer which applies activation function on the inputs.
'''
def __init__(self, activation, activation_prime):
self.activation = activation
self.activation_prime = activation_prime

def forward(self, input):
self.X = input.copy()
return self.activation(input)

def backward(self, output_error, learning_rate):

return output_error*self.activation_prime(self.X)

This class is dependent on the choice of activation function, which is set in the constructor. forward and backward are pretty straightforward(pun). Let’s now go over forward and backward(derivative) for various activation functions.

Sigmoid

def sigmoid(x):
x = np.clip(x, a_min=-100, a_max=100) # Clipping for numerical stability
return 1/(1+np.exp(-x))

def sigmoid_prime(x):
x = sigmoid(x)*(1-sigmoid(x)))
return x

Tanh

def tanh(x):
return np.tanh(x)

def tanh_prime(x):
x = 1-np.tanh(x)**2
return x

ReLU

def relu(x):
return 0.5*(np.abs(x) + x)

def relu_prime(x):
return x > 0

Note that “*” operator is for element wise multiplication between numpy matrices. ActivationLayer class is using element-wise multiplication(also called Hadamard product) in backward, and not matrix multiplication.

Softmax

class SoftmaxLayer:
'''
Implements a Softmax layer which applies softmax function on the inputs.
'''
def __init__(self, input_size):
self.input_size = input_size

def forward(self, input):
input = np.clip(input, a_min=-100, a_max=100)
self.softm = np.divide(np.exp(input), np.sum(np.exp(input),axis=1)[:, None])
return self.softm.copy()

def backward(self, output_error):
# diagflat for batched input (see references)
tmp = self.softm.copy()
N = tmp.shape[1]
tmp= np.expand_dims(tmp, axis=1)
tmp_diag = tmp*np.eye(N) # dims = (batch, N, N)

# 2nd term that needs to se subtracted from first
tmp = self.softm.copy()
term2 = np.matmul(tmp.expand_dims(axis=2), tmp.expand_dims(axis=1)) # (b, N, 1)X(b, 1, N) = (b, N, N)

grad = tmp_diag - term2 # (b, N, N)
return np.matmul(grad, output_error.expand_dims(axis=2)).squeeze()

Here the forward function implements the softmax operation along each row(remember that we have batched inputs).

backward requires some good explanation here. Derivative of softmax can be found here.

Derivative of softmax (Source : link)

We now want to implement this for batched inputs without using any for loops. If we wanted to do it for a single datapoint(unbatched) the following would have worked:

def backward(self, output_error, learning_rate):
tmp = self.softm.copy()
grad = np.diagflat(tmp) - tmp.T@tmp
return (output_error@grad)

grad above is for getting the following matrix(ignore numbers):

Derivative of softmax (Source : link)

We need to extend the above idea to batched inputs. So first we implement diagflat(link) and then outer product for batched inputs. Then using the shape thumb rule(need to squeeze at the end to be consistent with dimensions) we get the matrix to be returned.

Still easy?? (Source : link)

Cross-entropy loss

def cross_entropy(y_true, y_pred):
# y_true is a 1D array containing the gt classes
# y_pred is of shape batchXnum_classes, each row containing probability of respective classes
preds = y_pred[range(len(y_pred)), y_true] # to fetch the predicted probability of gt class from each row
return -np.mean(np.log(preds)) # NLL followed by average

def cross_entropy_prime(y_true, y_pred):
tmp = np.zeros_like(y_pred)
tmp[:, y_true] = -1/y_pred[:, y_true] # see references to find derivative of cross-entropy loss
return tmp

With this we have completed implementation of all the functions and classes. Two important things still remain is creating architecture using these functions and forward and back propagation using the defined model. Both of these can be found in fit function in the script.

Defining architecture

# defining the architecture of model
network = [
FlattenLayer(input_shape=(1, 28, 28)), #flattening the input image
FCLayer(28 * 28, 12),
ActivationLayer(sigmoid, sigmoid_prime),
FCLayer(12, 10),
SoftmaxLayer(10)
] # This creates feed forward

Here batch size is chosen to be 1(can be changed after batching the input datapoints).

Forward propagation

# forward propagation
output = x # x is input data
for layer in network:
output = layer.forward(output)
error += cross_entropy(y_true, output) # not used for training(only for getting the loss)

Back propagation

output_error = cross_entropy_prime(y_true, output)
for layer in reversed(network):
output_error = layer.backward(output_error, learning_rate)

With this we have completed all parts of the script. Unfortunately, the script is not batching the input data currently, hence our implemented functions would not work and might throw errors. After batching the inputs the can run and train a classification model for “mnist” and “flower” dataset.

With this, all parts of the script are complete. This isn’t even fraction of what libraries like Pytorch and Tensorflow are doing. We haven’t even considered simple operations on outputs from two different networks, nor have we made any computational graphs for backpropagation. Hope you now have an idea of how powerful these ML libraries, and certainly hope you find this blog insightful.

References

  1. Batched diagflat in numpy: https://stackoverflow.com/questions/53741481/vectorized-creation-of-an-array-of-diagonal-square-arrays-from-a-liner-array-in
  2. Softmax derivative : https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/
  3. Cross-Entropy and softmax derivative: https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1
  4. Stanford slides for backpropagation and gradients : http://cs231n.stanford.edu/slides/2018/cs231n_2018_ds02.pdf

--

--

Harsh-Sensei

Pursuing B.Tech in Computer Science Engineering at IIT Bombay. Eternally excited about robotics, machine learning and computer graphics