Part II. Model Evaluation: Cross Validation, Bias and Variance Tradeoff and How to Diagnose Overfitting

10 min readApr 14, 2019

In my previous post, I laid out a conceptual framework for building and evaluating a simple predictive model. I discussed the need for doing a train-test split and retrieved the training and testing scores using the R-squared value as an evaluation metric.

Build and Evaluate a Predictive Model with scikit-learn: A Walkthrough for Beginners

For those new to data science, phrases like predictive modeling and model evaluation may seem daunting. Of course…

medium.com

But there are some limitations with a simple train-test split which prevent us from getting a reliable understanding on how well the model will generalize to unseen data. In this post I will discuss these limitations and will introduce a remedy — cross validation. I will also discuss the bias and variance tradeoff and will dive deeper into what it means to evaluate a model.

Limitations of a Simple Train-test Split

Imagine a scenario in which a random train-test split just happens to result in most of the outliers in the train set. In this case, our model will train on data full of outliers, meaning, it will train on a sample that is not representing the true population well. When we run this model on our test set, or deploy it on any out of sample data, it is likely that its predictions will be very poor. This generalizes to the notion that the scores we use to evaluate the model can be highly variable depending on which observations end up being in the train set and which observations end up being in the test set. Consider the graph below:

This figure is borrowed from *An Introduction to Statistical Learning with Applications in R* written by G. James, D. Witten, T. Hastie and R. Tibshirani (figure 5.2, pg. 178).

The x-axis represents the degree of polynomial included in the model (the equation for a simple linear regression is of degree 1 but if we include quadratic terms in the equation then we have an equation of degree 2 and so on). The y axis represents the mean squared error (MSE) scores. Note: In this case, the evaluation metric is the MSE and not the R-squared value. The overall graph depicts ten different MSE scores over models with different degrees of polynomial from 10 random train-test splits. While the overall trends among all the curves appear to share the same shape, indicating that the MSE is lowest for a model which has quadratic terms in it, the height of the curves is highly variable. This confirms that different train-test splits will result in different MSEs. So when we do a single train-test split, we only get one score among many others that we could potentially get if we were to do many random train test splits. This doesn’t tell us much about the overall model performance.

Now consider another point. When we do a train-test split, depending on the size of the data, we often have to reserve about 30–40% of the data for the test set in order to have enough observations in the test set to evaluate predictions on. This is a significantly large subset of the data that is not being used to train the model. This is not ideal because training a model on few observations generally means that the model is not going to be very good.

The limitations of a single train-test split:
1. The scores for evaluating the model are highly variable depending on the observations that make up in train set and test set from a single train test split.
2. Reserving much of the data for a single test set reduces the number of observations we can use to train the model.

The Remedy — k-Fold Cross Validation

k-fold cross validation is a resampling method that is essentially a train-test split on steroids: we randomly divide the data into k groups (folds) of equal size. The first group becomes the test set and the remaning k-1 groups combined together become the train set. A model is is trained and built on the train set and is evaluated on the test set with an R_squared score (assuming a linear regression model) much like in a simple train-test split. However, this procedure is repeated k times where each time the train set and the test set end up being different. k models are built and evaluated resulting in k different R_squared scores. The average of these scores is the cross validation score. Here is how to retrieve the cross validation score in scikit-learn:

from sklearn.model_selection import cross_val_scorecv_score = cross_val_score(model, X, y, cv=5).mean()

where model is the instantiated model we want to fit and evaluate, X is the data containing predictor features and y is the target feature we wish to predict.

The visual example below shows how data is partitioned 4 times in a 4-fold cross validation for which our test set is 25% and the train set is 75% of our data for a given fold.

In a nutshell, cross validation allows us to use one data set available to us as k different samples to train our model on. Every single observation in the dataset eventually ends up being used once in a test set when we cross validate. It’s much like repeating an experiment of building a model k times on k different datasets and then averaging the results.

This is why the cross validation score is a much more reliable metric in helping us understand how the model can be expected to perform on unseen data. We could not draw such expectations from the train score and the test score resulting from a simple train-test split alone.

This figure borrowed from *An Introduction to Statistical Learning with Applications in R* written by G. James, D. Witten, T. Hastie and R. Tibshirani (figure 5.4, pg. 180).

The figure above shows how 10-fold cross validation was run 10 separate times, each with a different random split of the data into ten parts. Each cross validation provides one cross validation score. There are 10 slightly different cross validation error curves, but the curves are not as drastically different as they were in the case of doing 10 different train-test splits. Note: In this case, the evaluation metric is the MSE and not the R-squared value. Thus, the results from 10 cross validations are not as highly variable as they were in the case of doing 10 simple train-test splits.

Additionally, cross validation allows us to reserve less observations for a test set for any given fold, and therefore more observations are used in training the model.

Cross validation score gives us a more reliable and general insight on how the model is expected to perform on out of sample data.

Variance

So far the term ‘variability’ or ‘variance’ has come up a couple of times. Actually, this is a very important term in model evaluation and is closely tied with the notion of bias. Have you ever heard the aphorism “All models are wrong but some are useful?” A model is a mathematical function which is a simplified representation of some phenomenon or a process that’s captured in our data. A good model is one that has just the right amount of complexity: it is simple enough that it excludes all unnecessary and irrelevant details we don’t care about, but it’s complex enough to accurately represent the thing which we wish to model.

Let’s say we want to build a linear regression model and we have 100 predictor features in our data that we can use to train the model. Some features are good predictors for our target. They provide the signal in the data. Their relationship to the target is the function that we wish to capture with the model.

However, it’s likely that among all those 100 features, many of them are not really useful or relevant at all. They don’t predict the target feature. Such features are just additional information that happens to be in our data and they add noise.

If we were to include all 100 features to train and build a model, the model would use all of them without any discrimination between relevant and irrelevant features. It would not differentiate between signal and noise but rather it would fit itself on everything that’s provided to it in the train set.

When we have too many features, the model is prone to overfitting — the phenomenon in which a predictive model fits itself to noise in the data. Models that overfit are too complex. They tend to perform really well on training set because the large number of features provides more information for the model to train on, but they fail to generalize well on out of sample data. It’s as if the model is customized for the observations in the train set. Such a model is known to have high variance. If we were to train a high variance model on a different set of observations, then it would likely end up being a very different model.

Variance refers to the amount by which the predictions of a model would change if the model was trained using a different train set.

Bias

On the other hand, if we pass too few predictor features to our model and neglect including some that are actually relevant for predicting the target, then we are risking a model that’s too simple. Such a model is likely to be insufficiently complex to capture the signal in our data and in general ends up being just an inaccurate, bad model. In this case we say that the model has high bias. A biased model is also known to be underfit.

Bias refers to the error that is introduced by approximating a real-life comlex problem by a much simpler model that doesn’t contain all the relevant information necessary to represent the underlying signal in the data.

Here is a useful visualization:

As a general rule, more complex models will have high variance, and models that are too simple will suffer from high bias.

Bias and Variance Tradeoff

Both bias and variance contribute to errors the model makes on unseen data therefore affecting its generalizability. Our objective is to minimize both. This poses another challenge because reducing variance increases bias and vice versa.

The above figure represents the relationship between error and the bias and variance tradeoff. The best model is the one which achieves lowest error rate by minimizing bias and variance simultaneously. This is the model with just the right complexity represented by the dashed vertical line. Such a model will have good accuracy scores on train data but also on test data which means that it will be generalizable.

How Does This All Relate to Cross Validation?

It turns out that we can use cross validation score together with train score and test score obtained from a simple train-test split to diagnose bias and variance issues with a model. Here is a table that summarizes different scenarios of these scores which assumes accuracy scores as evaluation metrics:

If we were to choose an evaluation metric for error, rather than accuracy to assess the performance of our model, then the ‘high’ and the ‘low’ in the table would be reversed.

It’s important to remember the evaluation metric we choose: if we evaluate a model on accuracy, then the goal is to maximize it. If we evaluate a model on error, then the goal is to minimize it.

Baseline Accuracy

One other component that should be accounted for in model evaluation is the baseline score. A baseline score is the accuracy score we obtain from a null model — a model that always predicts the mean in case of a regression problem, and the majority class in case of a classification problem. Such a model has no predictive power. Prior to any model evaluation, a baseline score should always be obtained to give us an idea of what the lowest bar for our model is. The train score, test score and cross validation score must all beat the baseline score in order for us to be confident that the model has any predictive power at all. If these scores are roughly about the same as the baseline score, then we can certainly know that the model is no good and we should just discard it.

Let’s Review

A simple train-test split has certain limitations and we can use cross validation to remedy these. Cross validation gives us a score which more reliably tells us how well we can expect our model to perform on out of sample data. Using the cross validation score in combination with train score and test score can be very informative on diagnosing bias and variance issues our model may have. A good generalizable model will have high scores in all three cases (cross validation, train and test scores) which is an indication that we have achieved that sweet spot that balances and minimizes the bias and the variance.