Only Numpy: Understanding Back Propagation for Max Pooling Layer in Multi Layer CNN with Example and Interactive Code. (With and Without Activation Layer)

Jae Duk Seo
Jan 29, 2018 · 6 min read

So today, I wanted to know the math behind back propagation with Max Pooling layer. And I implemented a simple CNN to fully understand that concept.

Example 1 without Activation Function :
Network Architecture / Forward Feed Operation

XX → Original Image Dimension of (6*6)
Green W → First Weight Kernel Dimension of Weights (3*3)
Green WW → Second weight Vector Dimension of (4*1)

L1 → Layer 1 Image Dimension of (4*4), since we are not doing zero padding for XX
L1Max → Max Pooling Layer Applied to L1 Dimension of (2*2)
L1Max Reshape → Vectorized L1Max

Blue Circle Coordinates → Where the largest values are located, Please take note of the locations of the Coordinates, it becomes very important while performing Back Propagation!!

L2 → Layer 2 Dimension of (1*1)
Cost Function → L2 Norm Function

Example 1 without Activation Function : Back Propagation

Okay, so there is a lot going here, let me explain one by one.

Green Star → Back Propagation Respect to WW
Pink Star → Back Propagation Respect to W

Purple Star → Convolution Operation with the Kernel Rotated by 180 degrees (Other words transposed), we need to do this for proper update of gradient.

Red Box → Mathematical Form of finding the Coordinates of highest signal on L1
Blue Box → Matrix Form of actual Coordination of where the highest values were in variable L1. (This was the reason why I told you guys to please take note of the Blue Circled Numbers in the Forward Feed Process)

However, one question arises, please look at the Orange star, how can we perform element wise multiplication between Matrix that have Dimension of (2*2) and (4*4)? I will answer that in a moment for now, lets take a look at the actual implementation of the code.

Example 1: Training Data and Forward Feed Operation Implementation

As seen above, we can see that the input image have dimension of (6*6) first weight (Kernel)(W1) have dimension of (3*3) and weight (WW) (W2) have dimension of (4*1).

And we can see how the shape of the input changes overtime, and the Red Boxed line is calculating our cost.

Example 1: Back Propagation Implementation

Blue Box → Coordinates of where Max Values were in the position of Original L1

Red Box → Reshaped Calculated Gradient from Layer Before (2*2)

Green Box → The Secret of performing element wise multiplication, we expand the gradient to fit the dimension of the Blue Box.

Example 1: Interactive Code and Training Results

As seen above, after training the network is predicting well. To access the code, please click on this link.

Now lets take a look at the case where we included an activation function. I’ll give you a heads up, there is literally nothing different from what we just did, the only thing different is adding one additional element wise multiplication.

Example 1 with Activation Function :
Network Architecture / Forward Feed Operation

As seen above, the network architecture and it’s forward feed operation is very clear.

Orange W1 → First Weight Kernel of (3*3) Dimension
Orange W2 → Second Weight of (4*1) Dimension

The Red Boxed Region is where we are performing forward feed operation and, next to the Red Box, all of the equation written in Blue Marker are the back propagation process.

Please take note of the Blue Boxed region, since that is the part where we reshape the back propagated vector (1*4) into (2*2) Matrix. Then we feed to the Back Propagation Process of Max Pooling layer to create a (4*4) Matrix to perform element wise multiplication with dReLU(L1).

Also, take note of the Green Region Box as well, since I will give a detailed explanation of that region.

More “Detailed” Process of Forward Feed and Back Prop with Max Pooling

The Red Arrow indicates the Forward Feed Process, and the Blue Arrow indicates the Back Propagation Process.

Lets say our L1 Matrix was the (4*4) matrix shown above. After ReLU() layer all of the values smaller than zero will turn to zero.

The Blue Star Region is where we are applying the Max Pooling Layer with (2*2) Window. And as seen in the image above, we use the same coordinate information while performing back propagation as well.

So remember when I told you guys to take note of the green box? The above green box region shows an example of how element wise multiplication works while performing back propagation.

Example 2: Interactive Code and Training Results

As seen above, after training the results are closer to ground truth values. To access the interactive code please click on this link.

Huge Warning!!!!!

Thou Max Pooling Layer is great, we are now realizing that they might not be the best. Even Dr. Hinton think Max Pooling might not be the best idea, please see these links to find our more.

Click here for Blog Post
Click here for Dr. Hinton’s lecture “Whats wrong with CNN’s”

Final Words

I like Max Pooling layer, but I think I am going to use Mean Pooling from now on, it seems more interesting.

If any errors are found, please email me at

Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content. I also did comparison of Decoupled Neural Network here if you are interested.


The Bioinformatics Press

A publication that will serve as a hub for technology and…