Why over-parameterised deep nets don’t overfit?

Surprising property of training over-parameterised deep neural networks

Sharad Joshi
4 min readFeb 27, 2022

Over-parameterisation : the regime where the number of model parameters is greater than the number of training examples.

Generalisation : How well a trained model perform on unseen test data?

Conventional statistical theory says that in an over parametrised regime, overfitting is highly likely. We talk about bias-variance tradeoff and how increasing model complexity above certain threshold lead to potential overfitting, how simple regularisation methods like L1, L2 norms, local linearisation etc. helps reduce the effective complexity to combat overfitting, Occams’s razor and how increasing the number of training examples(so we’re not over-parameterised anymore) also helps fight overfitting.

Yet, we daily see deep neural nets with millions/billions/trillions of parameters trained on only 10000s or millions of training examples (many orders of magnitude difference between parameters and data) with almost 0 training error, generalising well on unseen data(test). It is rather fascinating and kinda mysterious to think about why this happens.What are we missing here? We’ve a huge amount of empirical evidence that such over-parameterised models regularly optimise and generalise well in deep learning but overfit easily in conventional ML.

In one of the seminal studies¹ that motivated this generalisation question is the…

--

--