#04TheNotSoToughML | “Let’s, minimize the error” — But, is that enough?

Anushree Chatterjee
Analytics Vidhya
Published in
12 min readJul 18, 2021

--

“All that glitters is not gold.” — William Shakespeare

What’s with the #(hashtag)?

Just sometime back I started a new series dedicated to just what it says — a series, where I ease out some gaps that one might have regarding an algorithm/concept by explaining the intuition behind it, rather than giving out the math straightaway. This just an attempt to make you realize that ML isn’t tough. Its more of intuition — proven algorithmically.

If you’d like to read the other articles in the series, refer to the ones starting with #TheNotSoToughML.

In the last article, we talked about error functions, gradient descent and how these two concepts relate to each other.

We will now, extend the concepts to build on the question -

Q. How do we know that our model works?

For this purpose, we will need to understand two very crucial concepts in ML:

Underfitting and Overfitting.

These concepts come into play when we are working with ML models and suddenly we realize that when we were building the model — the results seemed to make sense but the moment we put it into production — over time it has proven that the results have been quite wrong and the model didnt do a good job making predictions.

Of course, there are a zillion factors contributing to such “wrong” results but the two important complexities that are quite common in such scenarios are: Underfitting and Overfitting.

In the current article of the series, we will explore these two complexities at length and find ways to solve for it. While there are many techniques available, we will dive mainly into the following:

  1. Testing and validation of the model
  2. Using a Model Complexity graph
  3. Regularization (next article).

But first,

Underfitting vs Overfitting — What are they?

Though these can be interpreted in multiple ways, I like to think of them as a problem between over-simplification and over-complication.

How?

Well, lets take an example.

A situation of over-simplification.

Say, you have a task at hand to fight Godzilla. What if you go to the battleground just with a fly swatter? This is an example of over-simplification.

The approach won’t go well for us, because we underestimated the problem and came unprepared. This is underfitting: When our dataset is complex, and we come to model it equipped with nothing but a very simple model. The model will simply not be able to capture the complexities of the dataset.

Next, lets look at another example.

A situation of over-complication.

In contrast, if our task is to kill a small fly, and we get a bazooka to do the job, that is an example of an over-complication. Yes, we may kill the fly, but we’ll also destroy everything at hand and put ourselves at risk. We overestimated the problem and our solution wasn’t good. This is overfitting: When our data is simple, but we try to fit it with a model that is too complex. The model will be able to fit our data, but it’ll actually memorize it instead of learning it.

Yes, overfitting doesn’t seem like, as serious an issue, as underfitting, but the most important roadblock we reach with models that overfit is when we come across unseen data. The predictions, most likely, will be horrible! We will get a glimpse of this in the current article.

Every machine learning model has hyperparameters, which are the knobs that we twist and turn before training the model. Setting the right hyperparameters for our model is of extreme importance. If we set some of them wrong, we are prone to underfit or overfit.

Let’s look at some data now.

Understanding underfitting-overfitting using the example of Polynomial Regression

In all of the previous articles, we’ve learned how to find the best line fit for our data, assuming our data closely resembles a line. But what happens if our data doesn’t resemble a line? In that case, what can be useful is a powerful extension to linear regression called polynomial regression which helps us deal with cases in which the data is more complex.

Picture Credits: Grokking Machine Learning

Our linear regression models that we’ve have been talking about till now, are all polynomial equations of degree 1.

We define the degree of the polynomial as the exponent of the highest power in the expression of the polynomial. For example, the polynomial y = 2x3 + 8x2–40 has degree 3, since 3 is the highest exponent that the variable x is raised to. Notice that in the example above, the polynomials have degree 0, 1, 2, and 3. A polynomial of degree 0 is always a constant, and a polynomial of degree 1 is a linear equation.

The graph of a polynomial looks a lot like a curve that oscillates several times. The number of times it oscillates is related to the degree of the polynomial. If a polynomial has degree d, then the graph of that polynomial is a curve that oscillates at most d-1 times (for d>1).

Because of this feature of polynomial regression, its the best algorithm to see how under/overfitting can affect how the model learns from the data and ultimately, affect our predictions too. We will do this by tuning the most crucial parameter in a polynomial regression — it’s degree.

Say our data looks like this —

As humans, its easy for us to figure out just through the visual above, that this looks like a parabola (kind of similar to a sad face). However, for the computer to identify this as a parabola won’t be straightforward.

Let’s say the computer tries different polynomial degrees to fit this data in the following way —

Notice that Model 1 is too simple, as it is a line trying to fit a quadratic dataset. There is no way we’ll find a good line to fit this dataset, because the dataset simply does not look like a line. Therefore, model 1 is a clear example of underfitting.

Model 2, in contrast, fits the data pretty well. This model neither overfits nor underfits.

Model 3 fits the data extremely well, but it completely misses the point. The data is meant to look like a parabola with a bit of noise, and the model draws a very complicated polynomial of degree 10 that manages to go through each one of the points, but it doesn’t capture the essence of the data. Model 3 is a clear example of overfitting.

Now think of a situation —

From our last article, we know that, for the computer to know which model is the best — it just needs to select the one with least error. But if that would be the case, then the computer would land up selecting Model 3 as the points are closest to the curve in Model 3 (they are on the curve itself!). But we know the best model out of the three is Model 2!

What do we do?

We need to tell the computer that the best model is model 2, and that model 3 is overfitting.

Solution 1: By Testing & Validating

Testing a model consists of picking a small set of the points in the dataset, and choosing not to use them for training the model, but for testing the model’s performance. This set of points is called the testing set. The remaining set of points (the majority), which we use for training the model, is called the training set. Once we’ve trained the model on the training set, we use the testing set to evaluate the model. In this way we make sure that the model is good at generalizing to unseen data, as opposed to memorizing the training set.

Now let’s see how this method looks like with our dataset and our models. Notice that the real problem with model 3 is not that it doesn’t fit the data, it’s that it doesn’t generalize well to new data. In other words, if you trained model 3 on that dataset, and some new points appeared, would you trust the model to make good predictions with these new points? Probably not, as the model merely memorized the entire dataset without capturing its essence. In this case, the essence of the dataset is that it looks like a downwards parabola.

Let’s try visualizing the training and testing sets for the above models.

We can use this table to decide how complex we want our model

Picture Credits: Grokking Machine Learning

The columns represent the three models of degree 1, 2, and 10. The columns also represent, the training and the testing error.

The solid circles are the training set and the white triangles are the testing set.

The errors at each point can be seen as the vertical lines from the point to the curve. The error of each model is the mean absolute error (just for simplicity) given by the average of these vertical lengths.

Notice that the training error goes down as the complexity of the model increases. However, the testing error goes down and then back up as the complexity increases. From this table, we conclude that out of these three models, the best one is model 2, as it gives us a low testing error.

Thus, the way to tell if a model underfits, overfits, or is good, is to look at the training and testing errors. If both errors are high, then it underfits. If both errors are low, then it is a good model. If the training error is low and the testing error is high, then it overfits.

This brings us to a GOLDEN RULE we should never break when working with ML models:

“Thou shalt never use your testing data for training.”

When we split our data into training and testing, we should use the training data for training the model, and for absolutely no reason should we touch the testing data while training the model or making decisions on the model’s hyperparameters. Failure to do so is very likely to result in overfitting, even if it’s not noticeable by a human.

Now that you know the golden rule, and I tell you we already broke it in this article, can you figure out where and how?

This is where another dataset comes into picture which is, the validation set.

The validation set

Recall that we had three polynomial regression models: one of degree 1, one of degree 2, and one of degree 10, and we didn’t know which one to pick. We used our training data to train the three models, and then we used the testing data to decide which model to pick. We are not supposed to use the testing data to train our model or to make any decisions on the model or its hyperparameters. Once we do this, we are potentially overfitting!

What can we do, then? The solution is simple, we break our dataset even more. We introduce a new set, the validation set, which is then used to make decisions on our dataset. In summary, we break our dataset into the following three sets:

  • Training set: Used for training all our models.
  • Validation set: Used for making decisions on which model to use.
  • Testing set: Used to check how well our model did.

Thus, in our example, we would have two more points used as validation, and looking at the validation error should help us decide that the best model to use is model 2. The testing set should be used only at the very end, to see how well our model is. If the model is not good, I suggest that we throw everything away and start from scratch.

The size of these sets depends on a lot of factors depending on the size of the dataset itself — but ideally, the most common splits are generally a 60–20–20 split or an 80–10–10 split. In other words, 60% training, 20% validation, 20% testing, or 80% training, 10% validation, 10% testing. These numbers are arbitrary but they tend to work well, as they leave most of the data for training, but still allow us to test the model in a set that is big enough.

Solution 2: By using a Model Complexity Graph

Imagine that we have a different and much more complex dataset, and we are trying to build a polynomial regression model to fit this dataset. We want to decide the degree of our model among the numbers between 0 and 10 (including). As we saw in the previous section, the way to decide which model to use is to pick the one that has the smallest validation error.

The model complexity graph is a very effective tool to help us determine what is the ideal complexity of a model in order to avoid underfitting and overfitting

In the above graph, the horizontal axis represents the degree of several polynomial regression models, from 0 to 10 (i.e., the complexity of the model). The vertical axis represents the error, which in this case is given by the mean absolute error.

Notice that the training error starts large and decreases as we move to the right. This is because the more complex our model is, the better it is able to fit the training data. The validation error, however, starts large, then decreases, and then increases again. This is because very simple models can’t fit our data well (they underfit), while very complex models fit our training data but not our validation data, as they overfit.

There is a happy point in the middle where our model neither underfits or overfits, and we can find it using the model complexity graph.

The lowest value for the validation error occurs at degree 4, which means that for this particular dataset, the best fitting model (among the ones we are considering) is a polynomial regression model of degree 4. Looking at the left of the graph we can see that when the degree of the polynomial is very small, both the training and the validation error are large, which implies that the models underfit. Looking at the right of the graph we can see that the training error gets smaller and smaller but the validation error gets larger and larger, which implies that the models overfit. The sweet spot happens around 4, which is the model we pick.

One benefit of the model complexity graph is that no matter how large our dataset is or how many different models we try, it always looks like two curves, one that always goes down (the training error) and one that goes down and then back up (the validation error). Of course, in a large and complex dataset these curves may oscillate and the behavior may be harder to spot. However, the model complexity graph is always a useful tool for data scientists to find a good spot in this graph and decide how complex their models should be in order to avoid both underfitting and overfitting.

What next?

Next would be Solution 3: Regularization

This is yet another solution to prevent overfitting and a more efficient one but this needs one full article dedicated to it so we will take this up in the next one.

This article (and many more to come) has been inspired by the latest book I explored — Grokking Machine Learning by Luis Serrano. The book is yet to be released, but I bought an early access to it and I think it was a wise choice. Trust me, their books/materials definitely deserve to be read by anyone who is wanting to get the true idea behind algorithms and how models work.

I will be writing a review of the book soon, but if you’re keen on checking out the book already, you can go through its content here.

If you’d like to connect with me on LinkedIn, feel free to send me a note or a request here.

Of course, you can drop your comments here as well! I’d be happy to take any questions too.

Until next time, keep DSing, MLing and AIing. Most importantly, keep learning :)

--

--

Anushree Chatterjee
Analytics Vidhya

Love Data and what people call and don't call - Data Science. Love the Night Sky and discovering what's there - up there.