Overfitting Vs Underfitting

Chaitra Naik
5 min readJun 18, 2022

--

Image Source: Analytics Educator

In machine learning, overfitting and underfitting are phenomena that result in a poor model during the training process. These are the kinds of models you should avoid making during training because they can’t be used in production and are just a waste of time.

So, in this article, we’ll look at what overfitting and underfitting are, as well as the reasons for them.

Overfitting

Overfitting is the scenario where the Machine Learning model tries to learn from data along with the noise in the data and tries to fit each and every point on the curve.

So basically, when a model fits more data than it needs, it starts catching the noisy data and inaccurate values in the data. As a result, the efficiency and accuracy of the model decreases.

As this is rejecting the new data points, the flexibility of the model is very less. So accuracy of testing data is very less. Therefore, we can say that when training data accuracy is high and testing data accuracy is low then overfitting condition occurs.

Common reasons for Overfitting:

  • The data used for training is not cleaned and contains noise that is garbage values in it.
  • The model has high variance. Variance in simple words means the error of the test data.
  • Size of training data used is not enough.
  • The model is too complex.

We came upon the term “Variance” here. So, first and foremost, let’s define Variance in very simple terms.

The variability of model prediction for a specific data point or value that shows us the spread of our data is referred to as variance. A model with a large variance pays close attention to training data and does not generalize to new data. (Error Occurred as a Result of the Test Data)

Underfitting

In simple words, underfitting is the opposite of overfitting. To avoid overfitting, the best thing we can do is to stop the training at an earlier stage. But it might also lead to the model not being to learn enough from training data, that it may find it difficult to capture the dominant trend.

So basically underfitting is the scenario where the Machine Learning model can neither learn the relationship between variables in the data nor predict or classify a new datapoint. This model doesn’t fully learn the patterns, it accepts every new data point during the prediction.

Common Reasons for Underfitting:

  • The model has high bias. Bias in simple words means error of training data.
  • The data used for training is not cleaned and contains noise that is garbage values.
  • The model is too simple.

We came upon the term “Bias” here. So, first and foremost, let’s define Bias in very simple terms.

The gap between our model’s average prediction and the correct value we’re aiming to predict is known as bias. A model with a large bias ignores the training data entirely and oversimplifies the model. (Error as a Result of Training Data)

Let’s look at an example to better comprehend underfitting and overfitting.

Consider a scenario in which three students are taking an exam. So they have an English exam today, and all three pupils have studied for it.

As a result, student one has only studied for a few specific topics, such as figures of speech and adjectives, while leaving other topics, such as noun, tense, pronoun, and so on, unprepared. Student two has prepared for all the topics but has memorized every question from his/her notebook. Student three is completely prepared, having thoroughly researched all of the topics. Student one could only answer questions about figures of speech and adjectives at this point. Student two was only able to answer questions that he had memorized. Student three, on the other hand, was able to perfectly answer all of the questions.

Our Machine Learning models follow the same pattern. The first scenario is similar to an underfitting model because the learner only knows two topics and hence has fewer data. As a result, underfitting models learn from less data and are unable to make accurate predictions.

Student two is familiar with all of the topics, but only he or she could answer the questions in his or her notebook. This is known as overfitting, in which the model memorizes each and every data point but fails to perform correctly with new data.

The final scenario is for the best fit line, in which the student is an expert on each and every topic. The best fit line is obtained when the model performs well on both training and testing data.

Now, let’s look at an iris dataset and see if the model is underfitting, overfitting, or the best fit.

Divide the dataset into x and y

As a result, our x column will have all of the columns except species, whereas our y column will only contain species. We must convert nominal qualities into numbers since machine learning models are not good at nominal attributes (nominal basically implies connected to name).

This has now been converted to digits. Let’s prepare the dataset for training, testing, and splitting.

Here I had considered test size as 20%. Now use an algorithm(here I have used a decision tree).

Let’s have a look at the accuracy rating.

As a result, the training and testing data accuracy is good, indicating that this is a best fit model.

Thank you so much for reading this blog. If you like it then please give it a clap and share it with your friends, colleagues and family. Please share your ideas in the comments section. Because your feedback allows me to improve and provide better content in the future!!

--

--