Implementing batched Backpropagation(from scratch)

7 min readDec 15, 2022

In this blog, I would like to go through implementation of backpropagation algorithm as used in deep learning libraries like PyTorch, Tensorflow, Jax, etc. Through this, I aim to give better insights into the backpropagation algorithm and possibly admire the ease that the above mentioned libraries have brought to this realm of deep learning, which is often taken for granted ;) .

We will be using Numpy library for matrix operations and all implementation(forward pass and backward pass) is carried our for batched inputs. Much of the math used in this blog can be found in these stanford slides. I would highly recommend going through these slides, in case anything is unclear in this blog.

Below is the incomplete code snippet(in assignment.py) we will using as base code, and we will go into details of implementation of each function.

GitHub - Harsh-Sensei/Backpropagation-from-scratch-: Code snippet for backpropagation using numpy

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Now, there are already many online resources explaining the math behind backpropagation, few of which are listed below.

Backpropagation

Backpropagation is a commonly used method for training artificial neural networks, especially deep neural networks…

medium.com

Backpropagation from scratch: how neural networks really work

How do neural networks really work? A complete example, written from scratch in Python, with all the math, explained in…

towardsdatascience.com

The focus of this blog would rather be on having batched inputs without using for loops(that is, vectorized computations). Lets go…

FlattenLayer

class FlattenLayer:
    '''
    This class converts a multi-dimensional into 1-d vector
    '''
    def __init__(self, input_shape):
        self.input_shape = input_shape

    def forward(self, input):
        ## TODO
        return input
        ## END TODO
        
    
    def backward(self, output_error):
        ## TODO
        return output_error
        ## END TODO

forward would require reshaping the batched input to its flattened form. So, we would preserve the first dimension and flatten the rest.

def forward(self, input):
    input = input.reshape(self.input_shape[0], -1) # -1 would flatten the remaining dimensions
    return input

For backward, the error just needs to be reshaped to original shape(the input_shape). Hence,

def backward(self, output_error):
    ## TODO
    return output_error.reshape(self.input_shape)
    ## END TODO

Alright, moving to the next class…

class FCLayer:
    '''
    Implements a fully connected layer  
    '''
    def __init__(self, input_size, output_size):
        self.input_size = input_size
        self.output_size = output_size
        self.weights = np.random.randn(input_size, output_size)
        self.bias = np.random.randn(1, output_size)

    def forward(self, input):
        ## TODO
        return None
        ## END TODO

    def backward(self, output_error, learning_rate):
        ## TODO
        return None
        ## END TODO

forward requires carrying out affine transformation on the batched input. That is, multiplication by weights and then adding bias to the product. Note that we also need to save the input because it is required in backward function.

def forward(self, input):
    self.X = input.copy()
    return np.matmul(input, self.weights) + self.bias

backward is different than for FlattenLayer. Here we need to update the weights and then propagate(return) the gradients back.

    def backward(self, output_error, learning_rate):
        ret_error = output_error@self.weights.T
        self.weights = self.weights - learning_rate*(self.X.T@output_error)
        self.bias = self.bias - learning_rate*output_error
        return ret_error

To understand this, we need to know some matrix calculus. But there are some tricks which can be used to get the derivatives of matrix products easily. Let Y = X*W. Shapes of thee matrices are:- Y : b*c ; X : b*a ; W : a*c

Consider these 2 important results(or thumb rules):

Derivative of a scalar w.r.t. a matrix M is a matrix of same dimensions as M.
The derivative in case of linear transformations like, matrix multiplication in above case, can be found by matching shapes of matrices(see the example below to understand this).

Using these 2 results, we know output_error is of the shape b*c. We need derivative w.r.t. W to update W. Trying to match shapes, we need a*c using X and output_error. This can be obtained by transposing X and post-multiplying by output_error. Further, we need to find derivative w.r.t. X and propagate it backwards(here, just return it). Again, using similar analysis as above, gives ret_error as output_error post-multiplied by transpose of W.

From now onwards, I am skipping the #TODO codes and directly moving to the complete code snippets.

ActivationLayer

class ActivationLayer:
    '''
    Implements a Activation layer which applies activation function on the inputs. 
    '''
    def __init__(self, activation, activation_prime):
        self.activation = activation
        self.activation_prime = activation_prime
    
    def forward(self, input):
        self.X = input.copy()
        return self.activation(input)

    def backward(self, output_error, learning_rate):
        
        return output_error*self.activation_prime(self.X)

This class is dependent on the choice of activation function, which is set in the constructor. forward and backward are pretty straightforward(pun). Let’s now go over forward and backward(derivative) for various activation functions.

Sigmoid

def sigmoid(x):
    x = np.clip(x, a_min=-100, a_max=100) # Clipping for numerical stability
    return 1/(1+np.exp(-x))

def sigmoid_prime(x):
    x = sigmoid(x)*(1-sigmoid(x)))
    return x

Tanh

def tanh(x):
    return np.tanh(x)

def tanh_prime(x):
    x = 1-np.tanh(x)**2
    return x

ReLU

def relu(x):
    return 0.5*(np.abs(x) + x)

def relu_prime(x):
    return x > 0

Note that “*” operator is for element wise multiplication between numpy matrices. ActivationLayer class is using element-wise multiplication(also called Hadamard product) in backward, and not matrix multiplication.

Softmax

class SoftmaxLayer:
    '''
      Implements a Softmax layer which applies softmax function on the inputs. 
    '''
    def __init__(self, input_size):
        self.input_size = input_size
    
    def forward(self, input):
        input = np.clip(input, a_min=-100, a_max=100)
        self.softm = np.divide(np.exp(input), np.sum(np.exp(input),axis=1)[:, None])
        return self.softm.copy()

    def backward(self, output_error):
        # diagflat for batched input (see references)
        tmp = self.softm.copy()
        N = tmp.shape[1]
        tmp= np.expand_dims(tmp, axis=1)
        tmp_diag = tmp*np.eye(N)  # dims = (batch, N, N)
        
        # 2nd term that needs to se subtracted from first
        tmp = self.softm.copy()
        term2 = np.matmul(tmp.expand_dims(axis=2), tmp.expand_dims(axis=1)) # (b, N, 1)X(b, 1, N) = (b, N, N)
        
        grad = tmp_diag - term2 # (b, N, N)
        return np.matmul(grad, output_error.expand_dims(axis=2)).squeeze()

Here the forward function implements the softmax operation along each row(remember that we have batched inputs).

backward requires some good explanation here. Derivative of softmax can be found here.

Derivative of softmax (Source : link)

We now want to implement this for batched inputs without using any for loops. If we wanted to do it for a single datapoint(unbatched) the following would have worked:

def backward(self, output_error, learning_rate):
        tmp = self.softm.copy()
        grad = np.diagflat(tmp) - tmp.T@tmp
        return (output_error@grad)

grad above is for getting the following matrix(ignore numbers):

We need to extend the above idea to batched inputs. So first we implement diagflat(link) and then outer product for batched inputs. Then using the shape thumb rule(need to squeeze at the end to be consistent with dimensions) we get the matrix to be returned.

Cross-entropy loss

def cross_entropy(y_true, y_pred):
    # y_true is a 1D array containing the gt classes
    # y_pred is of shape batchXnum_classes, each row containing probability of respective classes
    preds = y_pred[range(len(y_pred)), y_true] # to fetch the predicted probability of gt class from each row
    return -np.mean(np.log(preds)) # NLL followed by average

def cross_entropy_prime(y_true, y_pred):
    tmp = np.zeros_like(y_pred)
    tmp[:, y_true] = -1/y_pred[:, y_true] # see references to find derivative of cross-entropy loss
    return tmp

With this we have completed implementation of all the functions and classes. Two important things still remain is creating architecture using these functions and forward and back propagation using the defined model. Both of these can be found in fit function in the script.

Defining architecture

# defining the architecture of model
network = [
    FlattenLayer(input_shape=(1, 28, 28)), #flattening the input image
    FCLayer(28 * 28, 12),
    ActivationLayer(sigmoid, sigmoid_prime),
    FCLayer(12, 10),
    SoftmaxLayer(10)
] # This creates feed forward

Here batch size is chosen to be 1(can be changed after batching the input datapoints).

Forward propagation

# forward propagation
output = x # x is input data   
for layer in network:
      output = layer.forward(output)
error += cross_entropy(y_true, output) # not used for training(only for getting the loss)

Back propagation

output_error = cross_entropy_prime(y_true, output)
for layer in reversed(network):
    output_error = layer.backward(output_error, learning_rate)

With this we have completed all parts of the script. Unfortunately, the script is not batching the input data currently, hence our implemented functions would not work and might throw errors. After batching the inputs the can run and train a classification model for “mnist” and “flower” dataset.

With this, all parts of the script are complete. This isn’t even fraction of what libraries like Pytorch and Tensorflow are doing. We haven’t even considered simple operations on outputs from two different networks, nor have we made any computational graphs for backpropagation. Hope you now have an idea of how powerful these ML libraries, and certainly hope you find this blog insightful.

References

Batched diagflat in numpy: https://stackoverflow.com/questions/53741481/vectorized-creation-of-an-array-of-diagonal-square-arrays-from-a-liner-array-in
Softmax derivative : https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/
Cross-Entropy and softmax derivative: https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1
Stanford slides for backpropagation and gradients : http://cs231n.stanford.edu/slides/2018/cs231n_2018_ds02.pdf