Only Numpy: Understanding Back Propagation for Max Pooling Layer in Multi Layer CNN with Example and Interactive Code. (With and Without Activation Layer)
So today, I wanted to know the math behind back propagation with Max Pooling layer. And I implemented a simple CNN to fully understand that concept.
Example 1 without Activation Function :
Network Architecture / Forward Feed Operation
XX → Original Image Dimension of (6*6)
Green W → First Weight Kernel Dimension of Weights (3*3)
Green WW → Second weight Vector Dimension of (4*1)
L1 → Layer 1 Image Dimension of (4*4), since we are not doing zero padding for XX
L1Max → Max Pooling Layer Applied to L1 Dimension of (2*2)
L1Max Reshape → Vectorized L1Max
Blue Circle Coordinates → Where the largest values are located, Please take note of the locations of the Coordinates, it becomes very important while performing Back Propagation!!
L2 → Layer 2 Dimension of (1*1)
Cost Function → L2 Norm Function
Example 1 without Activation Function : Back Propagation
Okay, so there is a lot going here, let me explain one by one.
Green Star → Back Propagation Respect to WW
Pink Star → Back Propagation Respect to W
Purple Star → Convolution Operation with the Kernel Rotated by 180 degrees (Other words transposed), we need to do this for proper update of gradient.
Red Box → Mathematical Form of finding the Coordinates of highest signal on L1
Blue Box → Matrix Form of actual Coordination of where the highest values were in variable L1. (This was the reason why I told you guys to please take note of the Blue Circled Numbers in the Forward Feed Process)
However, one question arises, please look at the Orange star, how can we perform element wise multiplication between Matrix that have Dimension of (2*2) and (4*4)? I will answer that in a moment for now, lets take a look at the actual implementation of the code.
Example 1: Training Data and Forward Feed Operation Implementation
As seen above, we can see that the input image have dimension of (6*6) first weight (Kernel)(W1) have dimension of (3*3) and weight (WW) (W2) have dimension of (4*1).
And we can see how the shape of the input changes overtime, and the Red Boxed line is calculating our cost.
Example 1: Back Propagation Implementation
Blue Box → Coordinates of where Max Values were in the position of Original L1
Red Box → Reshaped Calculated Gradient from Layer Before (2*2)
Green Box → The Secret of performing element wise multiplication, we expand the gradient to fit the dimension of the Blue Box.
Example 1: Interactive Code and Training Results
As seen above, after training the network is predicting well. To access the code, please click on this link.
Now lets take a look at the case where we included an activation function. I’ll give you a heads up, there is literally nothing different from what we just did, the only thing different is adding one additional element wise multiplication.
Example 1 with Activation Function :
Network Architecture / Forward Feed Operation
As seen above, the network architecture and it’s forward feed operation is very clear.
Orange W1 → First Weight Kernel of (3*3) Dimension
Orange W2 → Second Weight of (4*1) Dimension
The Red Boxed Region is where we are performing forward feed operation and, next to the Red Box, all of the equation written in Blue Marker are the back propagation process.
Please take note of the Blue Boxed region, since that is the part where we reshape the back propagated vector (1*4) into (2*2) Matrix. Then we feed to the Back Propagation Process of Max Pooling layer to create a (4*4) Matrix to perform element wise multiplication with dReLU(L1).
Also, take note of the Green Region Box as well, since I will give a detailed explanation of that region.
More “Detailed” Process of Forward Feed and Back Prop with Max Pooling
The Red Arrow indicates the Forward Feed Process, and the Blue Arrow indicates the Back Propagation Process.
Lets say our L1 Matrix was the (4*4) matrix shown above. After ReLU() layer all of the values smaller than zero will turn to zero.
The Blue Star Region is where we are applying the Max Pooling Layer with (2*2) Window. And as seen in the image above, we use the same coordinate information while performing back propagation as well.
So remember when I told you guys to take note of the green box? The above green box region shows an example of how element wise multiplication works while performing back propagation.
Example 2: Interactive Code and Training Results
As seen above, after training the results are closer to ground truth values. To access the interactive code please click on this link.
I like Max Pooling layer, but I think I am going to use Mean Pooling from now on, it seems more interesting.
If any errors are found, please email me at firstname.lastname@example.org.
- T. (2017, April 03). Geoffrey Hinton talk “What is wrong with convolutional neural nets ?”. Retrieved January 29, 2018, from https://www.youtube.com/watch?v=rTawFwUvnLE
- 1. (2018, January 17). What is wrong with Convolutional neural networks ? — Towards Data Science. Retrieved January 29, 2018, from https://towardsdatascience.com/what-is-wrong-with-convolutional-neural-networks-75c2ba8fbd6f
- How to perform max/mean pooling on a 2d array using numpy. (n.d.). Retrieved January 29, 2018, from https://stackoverflow.com/questions/42463172/how-to-perform-max-mean-pooling-on-a-2d-array-using-numpy
- Krishan, A. (n.d.). Posts about Transposed convolution on From Data to Decisions. Retrieved January 29, 2018, from https://iksinc.online/tag/transposed-convolution/