What is overfitting, *exactly*?

Examining the meaning of an imprecise term

Jack Bennetto
Slalom Build
6 min readJun 1, 2022

--

Abstract visualization suggesting extreme overfitting. Graphic by Sarah Kowalis

A big part of working at Slalom Build is embracing a growth mindset, learning the latest tools and technologies to solve novel engineering problems at scale. Lately I’ve been studying the many AWS machine learning services through an online course. It’s great for the things I don’t know, for detailing the far-reaching Amazon landscape. But it also covers topics in which I am an expert—things I taught for years at my previous job — and a few of the teachings are not quite right.

At one point the speaker, demonstrating the fitting of a neural network, points out that network is more accurate on the training data than the testing data. This, he says, means the model is overfit.

The usual explanation of an overfit model is vague: it fits too closely to the data on which it was trained. This is generally accompanied by a graph, perhaps a high-order polynomial behaving poorly in the gaps and edges.

High-order polynomial function fitting too closely to noisy data

Clearly this is overfit. Don’t do this. From there, the discussion moves to regularization: generalize the model to smooth things out.

The dilemma

Definitions online are surprisingly fuzzy. Wikipedia describes it as “the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.” The Introduction to Statistical Learning introduces it by saying “These more complex models can lead to a phenomenon known as overfitting the data, which essentially means they follow the errors, or noise, too closely.” (p 22)

But in math we like clear definitions. A term should be bound to a formula giving an absolute answer or quantitative value. Clarity matters above all.

First we’ll need to define our training error as the loss function (perhaps the mean of the square of the errors of our predictions) measured on the data we used to train the model, and our testing error as the loss on some hold-out set.

Let’s start with the speaker’s implied definition, a pretty common one I’ve heard elsewhere.

First proposed definition: a model is overfit if the training error is lower than the testing error.

Except this is usually true. One thing is almost always better than another, and it would be surprising if a model did better on data it hadn’t seen than data it had. It’s not enough to be a tiny bit better.

Second proposed definition: a model is overfit if the training error is a lot lower than the testing error.

This does fit our intuition: it’s ok for a for a model to do a bit better on training data, but not a lot better. Then it’s just trying to cheat, to do better than possible in the real world. That must be bad.

But how to define “a lot”? When does the difference become a problem? It turns out there isn’t an easy answer, or any answer at all. Many models, notably boosted decision trees, fit very closely to training data while producing state-of-the-art results. Test error is all that really matters.

Before moving on, take a few minutes to think about how we could define overfitting in a clear, consistent, meaningful way.

A definite definition

Here are my thoughts based on my own understanding and experience. First, we need to review a related topic that’s foundational to machine learning: the bias/variance tradeoff.

Most supervised-learning models include hyper-parameters that effect their performance. Many of these impact the bias/variance tradeoff: adjusting such a hyper-parameter in one direction will lower the bias and raise the variance; the other will do the opposite. The variance, in particular, is the error due to the variation in the predictions of models based on different training samples. If the model tries to match the training data exactly, it would likely look very different from one fit on different data and therefore will have high variance.

Graphs of bias, variance, irreducible error, and total mean-squared error over a range of models, showing high bias on one end, high variance on the other, and high total error at both extremes

The graph above shows bias and variance for a typical model as a function of such a hyper-parameter. The best model is at k=3, when the error is at a minimum. When k < 3 the variance is high, enough that the total error is high even though the bias is low: this region is overfit. So the answer is…well, this is a bit of a trick question.

Looking at one of those points alone isn’t enough; we see the overfitting from the entire graph. Overfitting isn’t a property of a single model but a comparison between models. Without that comparison the concept is meaningless.

Third proposed definition: A model is overfit compared to another if it has a higher error but lower bias.

Great!…except what are bias and variance, exactly?

Bias and variance, exactly

Again, there are plenty of rough explanations, often involving a picture like this.

Two-by-two grid of bull’s-eye targets with a scattering of points in each. The high variance points on the bottom are more spread out than the low-variance points at the top. The high-bias points on the right are off center, while the low-bias points on the left are centered around the bull’s eye.

But these do have clear definitions. Without going into the equations, we can consider the expected predictions of a model over a population — meaning the average predictions if we’d fit the model to many different samples, and the expected target — meaning the actual target values averaging out any unpredictable noise. The difference of these is the bias.

We might naively assume this should be minimized (or rather, the square of the bias, since the bias might be negative). But that ignores the other source of error that we can control: the variance of the predictions across models fitted on different samples (the average difference of the expected predictions and predictions of any one fitted model).

The key point for our purposes is that all of these are defined with respect to an entire population of data, not a single sample, so none of this applies to a fitted model that has already been trained, but rather to a model that has not yet been fit. So we can’t really define overfitting on a fitted model either, but a model together with a population.

Formal definition: A model is overfit compared to another with respect to some population of data if it has a higher error but lower bias.

A definition we can use

But a good definition should be useful as well as clear. That one is not: since we don’t measure bias we can’t say for sure if a model is overfit compared to another. But for any given hyper-parameter, we do know which direction tends to increase bias, so we can use that to compare similar models.

So here’s one I think we can use:

Practical definition: A model is overfit with respect to a hyper-parameter if adjusting it slightly in the direction of increased bias lowers the error.

The value of this definition is it requires that we provide context. A neural network trained too long might be overfit with respect to the number of training epochs, but that doesn’t mean we need to reduce the number of layers.

Focusing on why something is overfit, rather than just the idea that it is, emphasizes how we can improve our model rather than just point out the problem. And that’s a definitive improvement.

--

--

Jack Bennetto
Slalom Build

Jack is a machine-learning Architect at at Slalom Build.