Investigating Underfitting and Overfitting

Published in

Geek Culture

5 min readApr 17, 2021

In Machine Learning, if the model is able to fit on training data doesn’t mean that it will perform well on testing data. This disparity between the performance on the training and test data is called Generalization Gap. It is common in machine learning problems to observe a gap between the training and testing performance.

Usually increasing model complexity helps reduce training error, but it can also increase the risk of overfitting leading to a larger generalization gap. So what is Overfitting!

Overfitting

Overfitting is a fundamental issue in supervised machine learning which prevents us from perfectly generalizing the models to well fit observed data on training data, as well as unseen data on the testing sets.

A scenario where the machine learning model tries to learn from the data along with the noise present in the data and tries to fit each data point on the curve.

As the model has very little flexibility, it fails to predict new data points, and thus the model rejects every new data point during prediction. In overfitting, the model has a very low bias but has a high variance.

Reasons for Overfitting

Data used for training is not cleaned and contains noise (garbage values) in it
The model has a high variance
The size of training data used is not enough
The model is too complex

Methods to Avoid Overfitting

Cross-validation
Training with more data
Removing redundant features
Early Stopping
Regularization
Ensembling

Underfitting

In order to avoid overfitting, we could stop the training at an earlier stage. But it might also lead to the model not being able to learn enough from training data, which may find it difficult to capture the dominant trend. This is known as underfitting.

A scenario where a machine learning model can neither learn the relationship between variables in the data nor predict or classify a new data point.

As the model doesn’t fully learn the patterns, it accepts every new data point during the prediction. An underfitting model has low variance and high bias.

Reasons for Underfitting

Data used for training is not cleaned and contains noise (garbage values) in it
The model has a high bias
The size of training data used is not enough
The model is too simple

Consider the below plot,

It is clear that neither of these models is a particularly good fit for the data, but they fail in different ways.

Model on the left

Attempts to find a straight-line fit through the data. Because the data are intrinsically more complicated than a straight line, the straight-line model will never be able to describe this dataset well.
Such a model is said to underfit the data; that is, it does not have enough model flexibility to suitably account for all the features in the data.
Another way of saying this is that the model has a high bias.

Model on the right

Attempts to fit a high-order polynomial through the data. Here the model fit has enough flexibility to nearly perfectly account for the fine features in the data, but even though it very accurately describes the training data, its precise form seems to be more reflective of the particular noise properties of the data.
Such a model is said to overfit the data; that is, it has so much model flexibility that the model ends up accounting for random errors as well as the underlying data distribution.
Another way of saying this is that the model has a high variance.

Read the following article to understand bias and variance

So what is the best fit?

A line or curve that best fits the data is neither overfitting nor underfitting models but is just right.

You might wonder how we know if a model will perform on unseen data. We can’t test a model on data points that are truly unseen but we can hide a portion of our data during training and evaluate the model on that hidden data.

Typically, We can divide our data into three sets: training, validation, and testing sets.

It is a common practice to partition 80% of data for training, 10% for validation, and 10% for testing for small or mid-sized datasets. But for larger datasets with millions of samples, even 1% of the data might be enough for validation and test sets as long as they are partitioned in an unbiased way.

For an unbiased split, randomly shuffling the data before partitioning is preferred.

We can train our model on the training set and use the validation set for configuring our model which involves hyperparameter tuning.

Then why we need a test set?

Well, we don’t strictly need a test set. It might be ok in some cases to have training and validation sets only.

The purpose of the test set is to help us get an unbiased estimate of the generalization performance. Especially when we have a lot of hyperparameters to tune, there might be a risk of overfitting to the validation set.

Although the model never sees the validation set, we do and we have tuned the parameters accordingly. So models might end up being overly tuned to perform well on the validation set, yet do not generalize well to truly unseen data.

That’s why it might be beneficial to have a separate test set to use once we are done with training and configuring our model.

Summary

Overfitting is a general issue in supervised machine learning, which cannot be completely avoided. It happens because of either the limits of training data, which can have a limited size or include plenty of noises, or the complexity of the model.

But we can still reduce the effect of overfitting. On the one hand, to deal with noises in the training set, algorithms based on the early-stopping strategy help us to stop training before learning noises.

On the other hand, a data expansion strategy is proposed for complicated models which require plentiful data to fine-tune their hyper-parameters. Also, Regularization helps us distinguish noises, meaning, and meaningless features and assign different weights to them.

Thanks for reading this article! Leave a comment below if you have any questions. Be sure to follow @ArunAddagatla, to get notified regarding the latest articles on Data Science and Deep Learning.

You can connect with me on LinkedIn, Github, Kaggle, or by visiting Medium.com.