Do we really need the Pooling layer in our CNN architecture?
Introduction
A typical CNN architecture comprises of Convolution layers, Activation layers, Pooling layers and Fully Connected layer. In this article, we’ll discuss Pooling layer’s definition, uses, and analysis of some alternative methods.
What is Pooling?
Pooling is the process of downsampling and reducing the size of the feature matrix obtained after passing the image through the Convolution layer. In the Pooling layer, a filter is passed over the results of the previous layer and selects one number out of each group of values.
There are different types of Pooling strategies available, e.g., Max, Average, Global, Attention, etc. Most of the Pooling strategies will be available in Keras, but for some unique ones visit this.
Useful or not!
Main selling points of Pooling layer are :
- Drastic reduction in the number of parameters [see figure 1] due to dimensionality reduction, which also translates to lower epoch time.
- Helps in avoiding overfitting by focusing on just dominant features and as a side effect not letting the noise affect the weight updates.
- Translation invariance is also induced into the model due to the above point.
But….
- Due to the selection of dominant features, Pooling will not work on datasets [Cancer Cell detection] where key features are minute and irregular.
- While a single epoch takes less time, the number of epochs [and hyper-parameter tuning] needed to achieve convergence is high as compared to other methods, at least during my experimentation.
- Translation invariance comes with a disadvantage already stated in its name. If some components of an image are shuffled, Pooling will still give the same output even though the image doesn’t make any sense.
Okay, so what can be done to overcome these disadvantages?
Let’s discuss some alternatives in the next section. I’ve used the Fashion-Mnist dataset for experimentation. Hyperparameters remain almost the same for all the alternative methods just the architecture changes.
Alternative Methods
We implement three different strategies to gauge how they stand up against pooling layers.
1. No Pooling Layer: Simply eliminating pooling layer from architecture.
2. Convolution with strides: Replacing the pooling layer with a convolution layer with a stride of 2. This approach is an inspiration from Striving for Simplicity research paper.
3. Capsule Network: It’s a novel architecture that tries to capture an object’s information using vectors.
There are some great articles that discuss in-depth about CapsNet like this. Also, please do read the original research paper by Geoffrey Hinton.
Analysis
We’ll try to analyze different aspects of the models like Architecture, Training and Validation Loss and Accuracy, F1 score, number of parameters and time taken to achieve convergence.
Architecture:
As we can see in figure 1.b by eliminating pooling layers from the architecture, we end up with 32 M parameters as compared to 764K parameters figure 1.a. We find a middle ground in these extremities in CasuleNet[fig 1.d] and Convolution with stride[fig 1.c] with 8M and 1M parameters respectively.
But does more number of parameters translate to higher performance? Let’s find out.
Precision, Recall and F1 Score:
So umm.. results are pretty anti-climatic 😐
As we can see in figure 2 with the increasing number of parameter performance increases but it’s not substantial. Difference(accuracy) between the worst and best performing model is just 1.8% while the difference in # of parameters is ~32M.
Training and Validation Loss and Accuracy:
Other than CapsNet, all other models were able to converge within 10 epochs but for the sake of keeping hyper-parameters unchanged, CapsNet had to suffer.
Note: Though I could have used total training time rather than # of epochs as a criterion but due to the high training time of CapsNet [3 Mins for one epoch] compared to the model having Pooling layer [11 Sec for one epoch] I chose otherwise. Changing the hyper-parameters for CapsNet helps to bring down epoch time to 2 Mins and a total of 7 epochs to reach convergence.
Conclusion
From this data and my experimentation with other datasets, I concluded CapsNet was the best performing model overall with further hyper-parameter tuning.
But and this is a big one, I would suggest everything depends on your dataset. Yeah, I’m sorry there is no one size fits all solution in Machine Learning😁.
CapsNet worked for me for Fashion-Mnist and Doodle dataset but it might not work for some other dataset. So spend more time analyzing your data and learning in-depth about why a particular architecture was developed which will help you save a lot of time and energy.
Links to code: Pooling, No-Pooling and Convolution with stride and CapsNet.
This is my first article so please be forgiving in the comments and leave a clap if you like it. Again thanks for reading.