Simple CNN using NumPy Part IV (Back Propagation Through Fully Connected Layers)

Published in

Analytics Vidhya

5 min readJun 20, 2021

In the previous sections , we covered

In the 4th part of this series on CNN, we will try to cover back propagation through the fully connected layers in the network.

Just to recap, our network has the following set of operations in order from start to finish

Let N be the total number of images

It accepts an input matrix of size (N,1,28,28)
This is then followed by the first convolutional filter of size (2,1,5,5)
The convolution operation results in the matrix transforming to the size (N,2,24,24). This result is passed to a ReLU function.
A max pooling operation with a (2x2) filter with a stride of 2. This results in a matrix of size (N,2,12,12)
We then flatten this to an array of size (288,N)
This is then followed by a matrix multiplication, that converts the array to shape (60,N). This results is fed to the ReLU function.
The final operation consists of multiplying the results of the previous layer with another matrix that changes the shape to (10,N). This result is fed to a soft max function.

The model parameters that we initialize, will lead to incorrect estimates .The deviation between the estimates and the actual labels is quantified by a measure called cross-entropy. Cross Entropy is iteratively reduced by a process called gradient descent.

Intuition behind cross entropy

Let’s take a simple example, where we have three classes. Let the one hot encoded representation of the actual label be [1,0,0]. Let’s go through a few scenarios.

Let the predicted output be [1,0,0]. Cross entropy error would then be (-1*log(1))+(-0*log(0))+(-0*log(0)) which is 0.
Let the predicted output be [0,1,0]. Cross entropy error here will be (-1*log(0))+(-1*log(1))+(-0*log(0)). As log(0) is an infinitely large & negative value, the error would be infinitely high.
Let the predicted output be [0,0,1]. Cross entropy error would be (-1*log(0))+(-0*log(0))+(-0*log(1)). This would result in a error which is infinitely high.

Cross Entropy penalizes classifiers that give wrong outputs with high confidence.

So during the learning process, we need to decrease cross entropy. This is done through a process called gradient descent. Gradient Descent is an iterative procedure that adjusts the parameters of the model so that the outputs become as close as possible to the actual outputs

We expand this to the model parameters in the fully connected layers.

Gradient Descent in the fully connected layers

Here, the Greek letter alpha refers to the learning rate. This dictates how long it would take for the model to converge to the minima.

Gradient Descent Through the Fully Connected Layers

Calculating derivative of cross entropy with respect to W2 and B1

The code snippet below does the above

delta_2 = (final_fc-y_batch)dW2 = delta_2@fc1.T
dB1 = np.sum(delta_2,axis=1,keepdims=True)

Calculating derivative of cross entropy with respect to W1 & B0

Derivation of derivative of loss function with respect to W1 and B0.

The code snippet below calculates the derivative of loss function with respect to W1 and B0.

delta_1 = np.multiply(W2.T@delta_2,dReLU(W1@X_maxpool_flatten+B0))
dW1 = delta_1@X_maxpool_flatten.T
dB0 = np.sum(delta_1,axis=1,keepdims=True)

The dReLU() function above is the derivative of the ReLU function. The derivative of the ReLU function converts each positive value to a 1, while the rest becomes zero. The function is written as follows

def dReLU(x):
    return (x>0)*1.0

Calculating the derivative with respect to X0

We need to calculate the derivative of the loss function with respect to X0 . This will be used to calculate the derivative with respect to the convolutional filter. The derivation for the same is as follows.

Derivative of loss function with respect to X0

The following code snippet finds the derivative of the loss function with respect to X0

delta_0 = np.multiply(W1.T@delta_1,1.0)

Given that the the error of each model parameter with respect to the loss function is calculated from the end to the front of the network, this process is also called “back propagation”.

In the next section, we will try to go through calculating the error of the loss function with respect to the convolutional filter.

Resources

Feedback

Thank You for reading! If you have any feedback/suggestions, please feel free to comment below or you can email me at padhokshaja@gmail.com

Back propagation through max pooling filters, convolutional output layer & the convolutional filters.