Representation Learning Breakthroughs Every ML Engineer Should Know: Mind the Pool - Convolutional Neural Networks Can Overfit Input Size

Published in

Superlinear

7 min readSep 18, 2023

This year, I was fortunate enough to attend the International Conference on Representation Learning (ICLR) held in Kigali, Rwanda. I had the opportunity to explore the latest research, connect with fellow professionals, and soak in the collective wisdom of the AI community.

Via this blog series, I’d like to share four things every Machine Learning Engineer should know. This post is the third in a series of four, and will discuss a specific paper I got to discover at ICLR this year: “Mind the Pool: Convolutional Neural Networks Can Overfit Input Size”. The complete blog series will cover the following:

What is representation learning?: A refresher or introduction, depending on your familiarity, to set the stage for the upcoming papers.
https://medium.com/radix-ai-blog/representation-learning-breakthroughs-what-is-representation-learning-5dda2e2fed2e
Token Merging: Your ViT but faster: We’ll explore how this advancement makes more efficient use of the hidden representations found in Vision Transformers (ViT), making them substantially faster.
https://medium.com/radix-ai-blog/representation-learning-breakthroughs-token-merging-your-vit-but-faster-e3f88f25d6d1
Mind the pool: CNNs can overfit input size: A highlight of an underrecognized pitfall in Convolutional Neural Networks (CNNs) where they are biased by the input size of the images, and an approach on how to avoid it.
https://medium.com/radix-ai-blog/representation-learning-breakthroughs-convolutional-neural-networks-can-overfit-input-size-2aba1cb94c01
No reason for no supervision, improved generalization in supervised models: A showcase for exploiting representation learning to make more robust and general models that are trained on a supervised task.
https://medium.com/radix-ai-blog/representation-learning-breakthroughs-improved-generalization-in-supervised-models-d60a43a7f354

Each post aims to bring the ICLR 2023 experience to you, providing both practical applications and food for thought as we navigate AI’s exciting, ever-evolving landscape together. I hope you learn something new in each section of this blog post and that you find the research interesting, I surely did! Let’s dive into it.

Mind the Pool: Convolutional Neural Networks Can Overfit Input Size

Before diving into another theoretical topic, let’s use this smaller and easier-to-grasp paper as a moment to catch a mental breath. It zooms in on a seemingly niche yet critical problem in convolutional neural networks (CNNs): When designed to accommodate variable input sizes, these networks might be betraying us, subtly overfitting on their input’s size. In those cases, the pooling mechanism is likely to blame, embedding a persistent bias within the system. Luckily, it’s a problem that’s easy to solve by making the pooling stochastic during training.

This blog post only scratches the surface of the original paper, so if you’re intrigued and want to learn more about it, I highly recommend going through the paper: https://openreview.net/pdf?id=cWmtUcsYC3V. Nonetheless, this blog post touches on the most important aspects of the paper, with a focus on its practical applications. So, without further ado, let’s dive in and explore the key takeaways.

The TL;DR of Mind the Pool

CNNs and Input Size Overfitting: Convolutional Neural Networks, while remarkable in their capacity, have an Achilles’ heel when it comes to varying input sizes. They can overfit to certain sizes, leading to substantial dips in accuracy when compared with slightly smaller or larger input sizes.
Underlying Cause: At the root of this overfitting issue is the pooling arithmetic, a standard element in many CNN architectures. It inadvertently primes the network to develop a bias for specific input sizes, skewing its predictive capabilities.
Introducing Spatially-Balanced Pooling: This paper brings forth a novel solution in the form of spatially-balanced pooling. By depriving the layers of their typical arithmetic cues, this method ensures that the CNN doesn’t play favorites with its input sizes.
Enhanced Generalization and Robustness: Not only does spatially-balanced pooling level the playing field for varied input sizes, but it also fortifies the CNN against translational shifts, making it more resilient and adaptable.
Niche Yet Noteworthy: While it might seem like we’re diving deep into a tiny corner of the CNN world, this issue is of significant relevance, albeit in niche scenarios. Specifically, it resides in networks that employ global pooling. Even if it doesn’t directly apply to your current projects, it’s a fascinating facet of CNN behavior worth being aware of.

What is the Problem and When Does it Apply?

When it comes to tasks like image classification where the input size can vary greatly, we run into a somewhat niche yet critical hiccup. Picture this: you’re using a popular model architecture like Inception or ResNet, which typically incorporate a global average pooling layer at the final stage to handle the variable-sized inputs. This is where the trouble begins, due to the variable input, a bias might be introduced in the various pooling layers of your model, essentially leading to a significant degrade in performance.

Most models prevent problems like these to resize and warp the input to a desired format, like for example a R-CNN model would do. However, these practices make it harder for the classification model since the inputs get distorted significantly, messing with the very patterns and details the network needs to recognize.

R-CNN model that resizes and warps the inputs before feeding it into the classification component.

The following graph illustrates a curious trend in the performance of a ResNet18 model as the input sizes scale up, while still remaining squared (NxN). This specific model was trained on an input size of 224x224 pixels. One might expect a small increase in performance when moving to a larger input size, given that less information would be thrown away by the downsampling process of the input image, albeit at the cost of inference speed. However, what we observe is something quite peculiar: a periodic, oscillating pattern of accuracy that exhibits alternating rises and drops for even and odd input sizes, respectively. Strikingly, significant drops in performance surface at intervals of 32 pixels in both dimensions, immediately following each accuracy peak. This graph shows an odd trend: as the input size increases, the model’s accuracy doesn’t just steadily increase or decrease, it swings back and forth, taking a noticeable dip at regular intervals. It’s definitely something to keep an eye on if you’re working with images of different sizes.

Visualization of the observed accuracy drops over changes in input size.

The hiccup in performance is largely attributed to the pooling mechanism embedded in the model. Typically, pooling is conducted using a 7x7 window initially and then a 3x3 window, all with a stride of 2 to halve the input’s dimension each time. The ideal scenario, as seen with a 224x224 input size, is when no extra edges (found on the right and bottom sides) are consumed throughout the pooling layers, thanks to the odd-numbered dimensions of the pooling windows. The model is trained this way, which implies that the model is undertrained for other input sizes and thus cannot handle a padding input found in the right and bottom size. This contrasts starkly with a 225x225 input size, where all the extra edges are consumed, leading to a different and somewhat compromised behavior in the model. The reason we witness a significant dip in performance at intervals of 32-size increments (like the transition from 256 to 257) is due to the fact that this model has five pooling layers. Given the exponential nature of the layering (2⁵=32), this anomaly repeats, causing the model, originally trained on 224x224 images, to exhibit biased behavior and inefficiencies when dealing with inputs that deviate from the expected size.

Visualization of the padding consumption of different input sizes. For both 224 and 256 input shapes, the model does not consume any right and bottom padding, leading to maximal performance.

In essence, while CNNs with the capability to process arbitrary input sizes seem versatile, they can exhibit unexpected performance behavior due to nuances in their pooling arithmetic. This is critical when designing or choosing architectures for tasks with variable input dimensions.

How to Fix It

While the issue with varying input sizes may seem perplexing, the solution is refreshingly straightforward: Spatially-Balanced Pooling (SBPool). In typical CNN downsampling, the unconsumed padding always occurs at the right and bottom sides. This results in the network being overexposed to padding at the left and top sides during training. This repetitive behavior serves as a telltale cue for the CNN, signifying the specific input size, thus leading to the mentioned overfitting.

SBPool is designed to tackle this. During training, it modifies the downsampling layers to ensure that the potentially unconsumed padding can occur on any side of the input with equal probability. This simple twist ensures that over time, as the CNN processes various training samples, the unconsumed padding averages out across different sides. This elimination of the spatial bias in the training phase helps the model generalize better to arbitrary input sizes. An alternative approach to deal with this problem is to train on variable-sized inputs, but this would cause difficulties batching the different samples, making the SBPool approach the preferred option.

Visualization of how Spatially-Balanced Pooling operates.

The icing on the cake? You only need this adjustment during training. There’s no need to alter the pooling approach during inference, ensuring you get the robustness without compromising on speed. In essence, while the challenge of variable input sizes is intricate, solutions like SBPool prove that sometimes, a touch of spatial balance is all you need. The downside is that you indeed need to incorporate it during the training process, implying that current pre-trained models still have this problem, so be aware when you use them!

Conclusion

It’s clear that CNNs, while powerful, have their quirks, especially when it comes to handling different input sizes. Thankfully, straightforward yet effective solutions like Spatially-Balanced Pooling are at hand, ready to tackle these issues head-on. Implementing such strategies can streamline your CNN projects, making them not only more efficient but also more adaptable to various inputs. This paper also shows that when working with Deep Learning models, you should always keep an eye out for subtle biases and changes in performance, because not everything is always as it seems.