Evaluating Your Hypothesis and Understanding Bias vs Variance

Published in

Geek Culture

10 min readJun 6, 2021

This is part five of a series I’m working on, in which we’ll discuss and define introductory machine learning algorithms and concepts. At the very end of this article, you’ll find all the previous pieces of the series. I suggest you read these in sequence. Simply because I introduce some concepts there that are key to understanding the notions discussed in this article, and I’ll be referring back to them on numerous occasions.

Up to this point, we’ve looked at data preprocessing, as well as three supervised learning algorithms. Namely, linear regression, logistic regression, and neural networks. Today we’ll look at how we can evaluate our model, as well as discuss the notion of bias versus variance.

Let’s get right into it.

Evaluating Your Hypothesis

On many occasions in the previous posts, we’d use terms such as “good fit” or “properly trained”. But what does it really mean for a model to be properly trained? Consider the following points, for example:

And the following lines used to fit these points:

Which model do you think is the right one to go with? The answer is the model drawn in figure 3. To understand this, assume now that we add a new point. This point is represented in red in the following figure:

**Figure 5:** Same Graph with Random Point Inserted

What graph will yield the smallest error after the addition of this point? No need for any equations, visually we see that the distance from the point to the line is the smallest when using the line shown in figure 3 (blue line):

The concept shown here is that of overfitting, underfitting, and generalizing. In figure 2, our model is overfitted. Meaning it fits our training points perfectly, leaving us with zero error produced by the cost function. Some might think that this is a good thing. We saw, however, in figure 6 that this type of model doesn’t work with future data. After all, the whole point of training our model is so that we can then use it to predict the output of new, never before seen, data. In figure 4, our model is underfitted. The line does a very poor job of describing our data points, leaving us with a large error calculated by the cost function. The goal is to have a line that describes the general behavior of our training data. Figure 3 accomplishes just that. It’s a good enough line to describe not only the behavior of our training data, but can also adapt to new information as well.

Visually, this makes sense. But how do we perform such an analysis on more complex problems? Problems where, unlike the example we just saw, we have more than two features describing our data points, or, are of a higher order than just two.

Training Set vs Test Set vs Validation Set

You have the following functions:

Figure 7: Three Different Possible Functions

and you want to decide which h is the best to use to fit your data points. We’ve seen in the past how we use our data to train these models. Training the data will give us the Theta vector that will minimize our cost function. Something we haven’t mentioned, however, is how to check that our trained model will work on new data.

We generally don’t need the entire dataset to train our model. Most of the time, we split our dataset into three different parts:

Training Set: This data is used for exactly what we’ve been doing this entire time — Train our model so that its Theta vector minimizes the cost function. We normally use 60% of our dataset to train our model.
Validation Set: This set is used to tune and decide on the parameters to be used by our model. Don’t confuse parameters with our Theta vector. They’re not the same. The parameters we’re referring to here are any inputs to your model that change its behavior. An example is the learning rate alpha we saw for gradient descent, or the regularization term lambda we saw when working with neural networks. These are parameters that can be tuned to change the behavior of our model. We normally use 20% of our dataset as validation data.
Testing Set: This is the set used to test how well our model performs on new, never before seen data. We normally use 20% of our dataset as validation data.

A lot of the time, people will use the term “validation” and “test” interchangeably. It should be made clear that the test and validation sets are not the same. The validation set should not be used to test your model, otherwise, you’ll be introducing bias, which we’ll see later.

Now that we know the difference between the different types of datasets, here’s the general framework for choosing between different models, as in figure 7:

Train your models using your training set
Cross-validate using the validation set i.e. use your validation set to calculate the error produced by all your trained models. The model that outputs the smallest error is the one you select for testing
Test the model with the test set to see how well it performs with new data

Let’s say now, that you’ve gone through the first two steps highlighted above. The time has come for you to test your model, but to your surprise, it doesn’t perform well on your test set. What options do you have? Here are a few:

Getting more training examples
Trying smaller sets of features
Trying additional features
Trying polynomial features
Increasing or decreasing λ (regularization term)

Arbitrarily deciding which of these steps to take can be very time-consuming. With an understanding of bias vs variance, we can get a better idea as to why our model isn’t performing well enough and, as a result, select the right step to take next.

Bias vs Variance

When our graph is underfitted, we say we have a high bias. With a high bias, the value of our cost function J will be high for all our datasets, be it training, validation, or testing. Figure 4 is an example of a graph with a high bias.

When our graph is overfitted, we say we have a high variance. With a high variance, the model will perform well on our training data but poorly on our validation and testing data. This should be pretty intuitive if you understood the notion of overfitting. Since our model fits perfectly our training points, there should be no reason for our cost function J to output a high value for our training data. For our validation and testing data, however, the overfitted model won’t be generalized enough, and so our cost function will indicate high values, much higher than it did for our training set. Figure 2 is an example of a graph with high variance. Bias and variance are controlled by the complexity of the function we’re using. For functions with very high orders, we have the chance of overfitting. Conversely, for functions with very low orders, we have the chance of underfitting.

By understanding the impacts of certain model characteristics on our bias-to-variance ratio, one can get a better understanding of what steps to take in order to get better results from their model. All the graphs we’re about to analyze were taken from Andrew Ng’s introduction to machine learning course on Coursera.

Polynomial Degree

First, let’s compare the error we get to the degree of our functions polynomial, for the training and validation sets:

For lower ordered functions, the graph shows very high values for the error, no matter the data set. We’ve mentioned before that, when this is the case, we have an underfitted curve. As the degree of the function increases, the error for the training set diminishes. Makes sense, right? With a more complex curve, we can train our data to better understand our data points. But if our model is perfectly fitted, then overfitting occurs, as shown in the graph. As a result of our model being overfitted, it won’t react properly to new data and the error of our validation set will increase for very high function orders. We can come up with two conclusions:

If your model presents high bias, one solution is to increase the order of your function
If your model presents high variance, one solution is to diminish the order of your function

Regularization Parameter

In the last article, we very quickly described the need for a regularization parameter in our cost function, to avoid overfitting. This parameter works by adding an additional weight to our Thetas. The gradient descent algorithm will recognize that it must work to diminish the value of this added weight, and leave us with all Theta values diminished enough to prevent overfitting. One caveat to this approach is that, if our regularization parameter is too large, gradient descent will greatly diminish the parameters, resulting in an underfitted curve. If lambda is too small, we’ll fail to reduce the weight of large thetas, resulting in an overfitted curve. We need to find just the right lambda. We can compare the error we get to the regularization parameter, for the training and validation sets:

The information presented by this graph aligns exactly with what we just described about the regularization parameter. For too large of a lambda, our model will be underfitted, causing high bias and error. For too small of a lambda, our model will be overfitted, in turn causing high variance and a smaller error. Meaning:

If our model is highly biased, we can try diminishing our regularization parameter
If our model presents high variance, we can try to increase our regularization parameter

Learning Curves

One final thing we can look at when dealing with a poor model is the size of our data set. How can we conclude whether or not our model is performing poorly because it wasn’t provided enough training data? Let us look at two possible case scenarios. One, where the model has high bias, and the other, where the model has high bias.

The following learning curve presents the impact of the training set’s size on our model’s error, in the case where our model has a high bias:

**Figure 10:** Cost vs Training Set Size For High Bias Model

Figure 10 shows clearly that increasing the training set’s size will in no way help with our model’s performance. This should make sense by now. If our model has a high bias, then we know that it will perform poorly on both the training and the test/validation sets. Increasing the training set’s size will not have an impact. Although the error on the test set will decrease, it won’t do so at a rate high enough to reach our desired performance.

The next learning curve presents the impact of the training set’s size on our model’s error, in the case where our model has high variance:

**Figure 11:** Cost vs Training Set Size For High Variance Model

In this case, increasing the training set’s size will improve our performance. Although our training error increases, our test error will decrease, leaving us with a model that is more generalized. We conclude:

For models with a high bias, adding training samples will not improve our performance
For models with high variance, adding training samples will improve our performance

Conclusion

In this article, we saw how one can evaluate the performance of his model. We looked into the different datasets used to make sure we’re selecting the best possible model. One that doesn’t overfit nor underfit our data, but instead, generalizes in a way that it can perform well on new, never before seen, information. We also looked at the notion of bias and variance, and how one can tune his hyperparameters in a way to get the best bias-to-variance ratio. Finally, we discussed how we can use all these concepts to decide on the best steps to take when dealing with a poor-performing model.

In the upcoming article, we will conclude the section on supervised learning, by introducing support vector machines. Until then, I leave you with the following points to ponder upon:

We discussed how using the validation set can introduce bias to your model. Why?
Study the two learning curves we presented thoroughly. Do they make sense? Why, for example, does adding more training data increase the error of our training set when we have a high bias? Try fitting a line through a dataset with one training example, then add another training example, and another, until you notice the point we’re trying to make.
Does our dataset always need to be split in the 60%-20%-20% way we showed? When should we add/remove from the different splits?

Past Articles

Part One: Data Pre-Processing
Part Two: Linear Regression Using Gradient Descent: Intuition and Implementation
Part Three: Logistic Regression Using Gradient Descent: Intuition and Implementation
Part Four — 1: Neural Networks Part 1: Terminology, Motivation, and Intuition
Part Four — 2: Neural Networks Part 2: Backpropagation and Gradient Checking

Shameless Plug

Twitter: twitter.com/ali_khanafer2
LinkedIn: linkedin.com/in/ali-khanafer-319382152/