Simple CNN using NumPy Part IV (Back Propagation Through Fully Connected Layers)
In the previous sections , we covered
In the 4th part of this series on CNN, we will try to cover back propagation through the fully connected layers in the network.
Just to recap, our network has the following set of operations in order from start to finish
Let N be the total number of images
- It accepts an input matrix of size (N,1,28,28)
- This is then followed by the first convolutional filter of size (2,1,5,5)
- The convolution operation results in the matrix transforming to the size (N,2,24,24). This result is passed to a ReLU function.
- A max pooling operation with a (2x2) filter with a stride of 2. This results in a matrix of size (N,2,12,12)
- We then flatten this to an array of size (288,N)
- This is then followed by a matrix multiplication, that converts the array to shape (60,N). This results is fed to the ReLU function.
- The final operation consists of multiplying the results of the previous layer with another matrix that changes the shape to (10,N). This result is fed to a soft max function.
The model parameters that we initialize, will lead to incorrect estimates .The deviation between the estimates and the actual labels is quantified by a measure called cross-entropy. Cross Entropy is iteratively reduced by a process called gradient descent.
Intuition behind cross entropy
Let’s take a simple example, where we have three classes. Let the one hot encoded representation of the actual label be [1,0,0]. Let’s go through a few scenarios.
- Let the predicted output be [1,0,0]. Cross entropy error would then be (-1*log(1))+(-0*log(0))+(-0*log(0)) which is 0.
- Let the predicted output be [0,1,0]. Cross entropy error here will be (-1*log(0))+(-1*log(1))+(-0*log(0)). As log(0) is an infinitely large & negative value, the error would be infinitely high.
- Let the predicted output be [0,0,1]. Cross entropy error would be (-1*log(0))+(-0*log(0))+(-0*log(1)). This would result in a error which is infinitely high.
Cross Entropy penalizes classifiers that give wrong outputs with high confidence.
So during the learning process, we need to decrease cross entropy. This is done through a process called gradient descent. Gradient Descent is an iterative procedure that adjusts the parameters of the model so that the outputs become as close as possible to the actual outputs
We expand this to the model parameters in the fully connected layers.
Here, the Greek letter alpha refers to the learning rate. This dictates how long it would take for the model to converge to the minima.
Gradient Descent Through the Fully Connected Layers
Calculating derivative of cross entropy with respect to W2 and B1
The code snippet below does the above
delta_2 = (final_fc-y_batch)dW2 = delta_2@fc1.T
dB1 = np.sum(delta_2,axis=1,keepdims=True)
Calculating derivative of cross entropy with respect to W1 & B0
The code snippet below calculates the derivative of loss function with respect to W1 and B0.
delta_1 = np.multiply(W2.T@delta_2,dReLU(W1@X_maxpool_flatten+B0))
dW1 = delta_1@X_maxpool_flatten.T
dB0 = np.sum(delta_1,axis=1,keepdims=True)
The dReLU() function above is the derivative of the ReLU function. The derivative of the ReLU function converts each positive value to a 1, while the rest becomes zero. The function is written as follows
def dReLU(x):
return (x>0)*1.0
Calculating the derivative with respect to X0
We need to calculate the derivative of the loss function with respect to X0 . This will be used to calculate the derivative with respect to the convolutional filter. The derivation for the same is as follows.
The following code snippet finds the derivative of the loss function with respect to X0
delta_0 = np.multiply(W1.T@delta_1,1.0)
Given that the the error of each model parameter with respect to the loss function is calculated from the end to the front of the network, this process is also called “back propagation”.
In the next section, we will try to go through calculating the error of the loss function with respect to the convolutional filter.
Resources
- Video Tutorial on finding error of layer with cross entropy and softmax
- Micheal Neilsen’s Blog on Neural Networks
Feedback
Thank You for reading! If you have any feedback/suggestions, please feel free to comment below or you can email me at padhokshaja@gmail.com