Overfitting and Underfitting in Machine Learning

11 min readJan 31, 2022

Introduction

Suppose we’re in a math class and today, we will get familiar with the Pythagorean Theorem otherwise known as how to process the length of the hypotenuse of a right triangle knowing the length of the two different sides. I’m sure this isn’t too tough?

So in math class, what if instead of just directly telling you what the theorem is, I tell you to find the right equation or the right formula yourself by just using some data given? The data provided to you will have lengths of the three sides of eight different right angled triangles labeled a, b and c, with c being the hypotenuse.

The data is as given below:

You have 30 minutes to figure out the correct formula, as we’re going to take a test, where you will be given values of a and b, and you have to predict the correct values for c. In class, whoever scores the highest will be free from any homework for 2 complete weeks. Isn’t that cool?

Your lazy friend in class, say Atharv, is really excited about this. He isn’t interested in doing any homework, and so wants to score the first position in the test. So Atharv memorized all the 8 rows of data, and he can now correctly tell the value of c for given a and b, provided that they came from the data above. He really thinks he’d win. Would he?

Now your other friend, Dhriti, found this test amusing not because she wouldn’t have to do homework, but because she really wants to prove her smartness to everyone in class. For 30 minutes she struggled and came up with different equations. When 30 minutes were over, all she had was a + b = c.

It’s time to put everything to the test.

Now unfortunately for Atharv, none of the values in the test were from the data given to him, and he was extremely disappointed as he scored 0 and would now have to do homework. Dhriti was pretty bummed too, as her equation didn’t work out. But when she came to know that the equation was a2 + b2 = c2, she thought that with some extra time, she could have come up with that equation too.

With some backstory ready, we are all set to understand underfitting and overfitting now, but before that we need to get some terminology clear.

Noise — Any unnecessary or irrelevant data that can reduce our model’s performance is termed as Noise

Signal — The true underlying pattern of the data that helps the machine learning model to learn from the data is known as Signal.

Bias — To make the learning relationship between the input features and the output easier, the model tends to make certain assumptions known as bias.

Variance — By definition, it is a kind of error that occurs because of any model’s sensitivity to small fluctuations in the training data set. Variance is simply the measure upto which your model responds (in terms of performance metrics) to change in the training dataset.

Fit — How well you approximate a target function.

What is Overfitting?

Coming back to our math class, what Atharv did is called overfitting. He memorized everything given to him in the dataset, but couldn’t do well on any new data given to him. His results relied a bit too much on the training data given to him.

The same concept applies to our machine learning models. When the model learns the training data too well, it learns the underlying pattern and the noise in the training data, to a great extent. This negatively impacts the performance of the model on any new or unseen data. This means that the model has learned the random fluctuations or any noise in the training data as concepts. What is the issue with that? The problem is that if there is any unseen data, we cannot apply these concepts and the model becomes unable to generalize.

Let’s compare a normal model, and an overfitted model with the help of a graph.

What do we mean by Generalize here though?

Generalization is essentially a measure that helps to know how well a particular model is trained, or how well it has learned patterns, and how it performs or applies on unseen data. Any model’s goal is to perform well on new data from the problem domain, provided the input and output features remain unchanged.

Coming back to overfitting, let’s take another example. Say you visit city “X”, and since you’re a newcomer there, the street vendors charge you higher than the actual price, the taxi drivers charge excessively high fares. What do you do then? You decide that the people of this city are dishonest. This is a common human trait of “generalization”.

Maybe another taxi driver charges you properly, and not a higher fare, you still don’t believe that driver, and assume that they are dishonest from your past experiences. This is another example of “overfitting” (or overgeneralization). Whenever a model is overfitting, we say that a model has a “high variance”. If we take an example of target shooting or archery, then high variance, by that analogy, would be equivalent to having an unsteady aim.

From this, we can say that if any model performs well on an unseen dataset, then it is a good fit or a best fit model. But if it doesn’t perform well on the test dataset, but performs good enough on the training data set, then it is an overfit.

What is Underfitting?

Remember Dhriti? What had she done? Her approach was good, definitely. She tried her level best, but couldn’t find the correct formula. Her work didn’t find good results in the data given by the teacher, and of course, no good results in the test as well. This is what we call underfitting in Machine Learning. If she had been given more cases she would have tracked down similar outcomes on the test set. This is what we call a “high bias”.

Going back to our archery example, this is how high / low variance and bias affect the performance of the model:

Thus we can summarize that an underfit model cannot generalize well enough to new data, and can’t even model the basic training data, and a best fit would be when the bias as well as variance is low.

NOTE : Following is a more visual representation of underfitting and overfitting, which involves some code and graphs. Even if you’re not too comfortable with the code section, don’t worry, the graphs will give you a clear idea about the topic. You’ve already understood overfitting and underfitting!

To understand it better, let’s take a look at an example. We’re going to generate two variables, say P and Q. P will have some random numbers / samples, while Q is going to be a part of the cosine function.

Based on the data, let’s create a graph, a simple plotting between P and Q. We’re not going to discuss the values of the variables, or how we came up with them. We’re going to focus on how the degree of the polynomials affect the fit of our model.

Woah! Big terms! What do these terms mean?

A common practice in machine learning is to create new features by raising existing features to an exponent. For example if we had one input feature Z, then a “polynomial feature” would be the addition of a new column, which is the new feature. In this column, we have values which are Z2, as in squares of values in column Z. Why do we do this? This can help in increasing the performance of our model.

The number of features to be added in the model, can be controlled by the “degree” of the polynomial.

plt.plot(p, q, color = 'c', label="Actual")
plt.scatter(p, q, edgecolor='r', s=20, label="Samples")
plt.xlabel("p")
plt.ylabel("q")
plt.legend(loc="best")
plt.show()

The resultant of this code, which gives us the basic relationship between p and q is the following graph:

Now, it’s time to put these values to our model. We’re going to use “Linear Regression”. Curious? Google it!

plt.plot(p, q, color = 'c', label="Actual")
plt.scatter(p, q, edgecolor='r', s=20, label="Samples")
plt.plot(p1, q_pred, label="Model")  # q_pred here is the prediction set of q values
plt.xlabel("p")
plt.ylabel("q")
plt.legend(loc="best")
plt.show()

We can see that our straight line (the model) is unable to capture the patterns in the data. This is a clear example of underfitting.

So, back to overfitting and underfitting, we’re going to test our model on 3 degrees [1, 4, 15] and try to find the best fit. Best fit is basically the sweet spot between overfitting and underfitting.

Now to take a clearer look at how degrees of the polynomial affect our model, check out this graph:

From the above graph comparing degrees, we can observe that a polynomial with degree 1 isn’t enough to predict, it has high bias and hence is an “underfit”. The graph with degree 4 has the best fit; it approximates the original true relationship between the dependent and independent features near perfectly for the test data. For degrees higher than the best fit, it is obvious that the model happens to be an “overfit” and we can see that with an example degree of 15. This overfit model has learnt the noise in the data, and has high variance.

Here’s a flowchart that summarises everything discussed above.

We now know what overfitting and underfitting is, but how do we detect it?

Detecting Overfitting and Underfitting:

Can you detect these conditions before testing the data? No, it’s almost impossible. We have to test it.

One common and effective method is to split the dataset into two parts (this can be done in various ways — refer to train_test_split of sklearn) namely the training and testing part. If our model does a good job on the training data set and achieves an accuracy of say 90% but only an accuracy of around 50–55% on the test dataset, then the model is likely overfitting.

However, if it is the opposite, as in the case of underfit, the model is not guaranteed to perform better on the test when compared to train. It performs poorly on both the training and test sets.

If it does well on both the datasets, then it is surely a good fit, or the best fit.

How to prevent overfitting and underfitting:

Preventing Overfitting:

1. Using Cross validation:

Cross validation is a very powerful preventive measure against overfitting, with a clever idea. Create multiple mini train-validation splits within the original train data, and use these to tune your model.

We have a standard way of doing so, called the “k-fold cross validation”. What we do here is, partition the dataset into k subsets which we call “folds”.

Then we iteratively train our model on k — 1 folds. Why? We keep the last one as a validation dataset. The last fold is called the “holdout fold”.

Using cross validation, you can tune your hyperparameters only with your original training dataset. This way, you can keep the validation set as a completely unseen dataset.

K-fold cross validation might not completely remove the overfitting, so we can change the folds every now and then, or use multiple k-fold cross validations together.

2. Train with more data:

The signal will be better detected by the algorithm if we train the model with more data. Although it doesn’t work every time, for example if we just add more noisy data, then this technique won’t help.

3. Remove Features

In algorithms that don’t have a built-in feature selection, their generalization can be manually improved by removing some irrelevant or unimportant features. Why is removing features helpful? Sometimes it may happen that the model may fail to generalize simply because the model missed the patterns that should have been detected, and the data was too complex for the same.

4. Regularization:

As we discussed earlier, that overfitting could be a consequence of the model being too complex. Can we forcefully make it simpler? Yes! Regularization is the term for a range of techniques that could be used to force your model into being simpler. The techniques used to regularize any model will depend on the model itself. For example, an option could be pruning a decision tree, on neural networks you could use a dropout, or, to the cost function in regression, a penalty parameter could be added. Confusing terms eh? Google them!

Preventing Underfitting:

1. Increasing the Complexity of the model:

A probable cause of underfitting could be the fact that the model is not complex enough to understand the underlying patterns in data. Making the switch to say, a non-linear model from a linear model or, say, by adding more hidden layers to your existing neural network could be ways to make the model more complex, and in turn could help in removal of underfitting.

2. Reducing Regularization:

Well, underfitting is somewhat close to the opposite of overfitting. So, as we read earlier, regularization could help solve the overfitting problem, then reducing it could solve the underfitting problem! Some of the algorithms you use by default include some regularization parameters meant to suppress overfitting. Sometimes, these could also cause hindrance in the learning of the algorithm. Decreasing their values for the most part, makes a difference. You must be thinking, that if underfitting is almost the opposite of overfitting, then maybe adding more features or data would help solve the problem? NO! If the dataset lacks features that are decisive, and important that could help your model in detecting patterns, you can multiply the training data set by 2 or 5, or even 10, but it will not help in making your algorithm better. It is a common notion to think that throwing more data would solve the problem, but as stated earlier, it might just jeopardize the project.

Good fit in a Statistical Model:

After reading all about overfitting, underfitting, it’s preventive measures, I’m sure you’ve got a rough idea on a “good fit”.

Good fit is the perfect area between an underfit and an overfit model, which is generally a little difficult to achieve in practicality. To achieve it, we judge the performance of an algorithm over time, as it continues to learn the training data.

Over time, as the algorithm learns, the error for the model on the training data goes down ans so does the error on the testing data. On the off chance that we train for a really long time, the performance on the training dataset may keep on increasing on the grounds that the model is overfitting and learning the noise and pattern in the training dataset. Simultaneously, the error for the test set begins to rise again as the model’s capacity to generalize diminishes.

The ideal point is just before the error on the test set starts to increase, when the model has done a good job on both the training, and the unseen set.

So this was all in overfitting and underfitting ,I will be coming with more such articles in the future.

Till then follow me for more !