Overfitting and Underfitting

Ashish Gusain
Analytics Vidhya
Published in
3 min readMay 27, 2020

Before getting to the topic, let’s see 2 other term, bias and variance.

Error due to Bias : Error due to bias is the amount by which the expected model prediction differs from the true value of the training data. It is introduced by approximating the complicated model by much simpler model. High bias algorithms are easier to learn but less flexible, due to this they have lower predictive performance on complex problems. Linear algorithms and oversimplified model lead to high bias in the model.

Error due to variance : Error due to variance is the amount by which the prediction, over one training set, differs from the expected value over all the training sets. In machine learning, different training data sets will result in a different estimation. But ideally it should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in results.

Low bias and high variance can be seen above.

High bias and low variance can be seen in the second image. Therefore, there is always a trade off between bias and variance.

Now, coming to overfitting and underfitting.

Underfitting:
Its occurrence simply means that our model or the algorithm does not fit the data well enough. It usually happens when we have less data to build an accurate model and also when we try to build a linear model with a non-linear data. In such cases the rules of the machine learning model are too easy and flexible to be applied on such minimal data and therefore the model will probably make a lot of wrong predictions.

In this case, either change your model, get more data and also reduce the features by feature selection.

Overfitting:

When a model gets trained with so much of data, it starts learning from the noise and inaccurate data entries in our data set. Then the model does not categorize the data correctly, because of too many details and noise. The causes of overfitting are the non-parametric and non-linear methods because these types of machine learning algorithms have more freedom in building the model based on the dataset and therefore they can really build unrealistic models.

The main reason behind overfitting can be either training for a long period of time or using a polynomial algorithm. A solution to avoid overfitting is using a linear algorithm if we have linear data or using the parameters like the maximal depth if we are using decision trees.

The above diagram clearly shows how bias and variance are related to underfitting and overfitting. Clearly, if our model is underfitting, we are suffering from high bias and if our model is overfitted, we have a high variance problem.

Even when we compare the errors in our training and validation sets, clearly while overfitting training error will be very less and validation error will be very high. In underfitting, both the training and validation error are nearly equal and high.

How to Prevent Overfitting

  1. Cross-validation. Cross-validation is a powerful preventative measure against overfitting.
  2. Train with more data. It won’t work every time, but training with more data can help algorithms.
  3. Remove features : Sometimes
  4. Early stopping : This will surely stop overfitting our model, but sometimes our model does not gets trained completely.
  5. Regularization : One of the techniques widely used to prevent overfitting.

How to Prevent Underfitting

  1. Increase the size or number of parameters in the ML model.
  2. Increase the complexity or type of the model, either by adding more layers or increasing their size.
  3. Increasing the training time until the cost function is minimised.

This is all from my side. You can reach me via:

Email : ashishgusain12345@gmail.com

Github : https://github.com/AshishGusain17

LinkedIn : https://www.linkedin.com/in/ashish-gusain-257b841a2/

--

--

Ashish Gusain
Analytics Vidhya

Full Stack Developer | MERN Stack | Data Science | ML