Overfitting vs. Underfitting

A guide to recognize and remedy your machine learning model

Nabil M Abbas
The Startup
4 min readJan 13, 2020

--

One of the most alarming indicators of a poorly performing machine learning model is an accuracy test of the training and testing data. A test of your data will indicate if your model is overfit, underfit, or balanced. The reason we have train-test split is so that we can determine and adjust the performance of our models. Otherwise we would be blindly training our models to predict without any insight on the model’s performance.

Underfitting

“Your model is underfitting the training data when the model performs poorly on the training data.”

Causes

  • Trying to create a linear model with non linear data.
  • Having too little data to build an accurate model
  • Model is too simple, has too few features

Underfit learners tend to have low variance but high bias. The model simply does not campture the relationship of the training data, leading to inaccurate predictions of the training data.

Remedies

  • Add more features during Feature Selection.
  • Engineer additional features within the scope of your problem that makes sense.

Having more features limits bias within your model.

Overfitting

“Your model is overfitting your training data when you see that the model performs well on the training data but does not perform well on the evaluation data.”

Causes

The primary cause of models being overfit is that the algorithm captured the “noise” of the data. Overfitting occurs when the model fits the data too well. An overfit model shows low bias and high variance. The model is excessively complicated likely due to redundant features.

Remedies

When a model is overfit, the relationship between model features and the target variable is not being captured.

One remedy for this is k-fold cross validation. It is a powerful preventative measure against overfitting. The idea behind cross validation is that you are performing multiple mini train-test splits to tune your model.

In standard k-fold cross-validation, we partition the data into k subsets, called folds. Then, we iteratively train the algorithm on k-1 folds while using the remaining fold as the test set (called the “holdout fold”).

Source: https://elitedatascience.com/overfitting-in-machine-learning

A second remedy is that you can train with more data. This won’t work in every case, but in scenarios where you are looking at a skewed sample of data, sampling additional data can help normalize your data. An example of this is if you model height vs. age of children, sampling from more school districts will help your model.

A third remedy is that you can remove features. But it is important to have an understanding of feature importance. You have to be mindful of the problem you are trying to address and have some domain knowledge. Ultimately redundant features will not help and should not be included in your machine learning model.

Additional Remedies

Regularization is a method that entails a variety of techniques to artificially force your model to be simpler. The technique being used depends on the type of learner you are using. For example, for a linear regression you can add a penalty parameter to the cost function. “But oftentimes, the regularization method is a hyperparameter as well, which means it can be tuned through cross-validation.” To learn more about regularization in regards to particular algorithms have a look at the link.

Ensembles are a machine learning method to combine predictions from multiple separate models. Ensembles use bagging to attempt to reduce the chance to overfit complex models, and boosting to improve “predictive flexibility of simple models.”

Bias Variance Trade Off

Source: http://scott.fortmann-roe.com/docs/BiasVariance.html

Ultimately Data Scientists have to make decisions as to how they want their model to predict. They have to understand their model and why it is predicting a particular way. The ideas of overfitting and underfitting fall under the umbrella of the Bias Variance Trade Off. Ultimately error can come from both bias and variance, so the Data Scientist needs to be able to find a balance. But I’ll leave the Bias Variance Trade Off for a future post.

Thanks for reading!

Sources:

https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html

--

--

Nabil M Abbas
The Startup

Data Scientist, with a background in Mechanical Engineering from NYU. Interests include sports, mental health, humanitarian support and tech news.