Overfitting review and the Validation intervention
The main challenge when designing a ML algorithm is to avoid overfitting, a phenomenon that causes poor performance. Whit validation, one can keep track of the quality of the model, notice when it overfits and prevent the resulting bad solution.
This article belongs to a series devoted to figuring out a way to tackle a Machine Learning Competition: the first set of articles deals with Data Preprocessing and Features Engineering (if you have missed it, check out the first story here).
In this article we review the basic but essential concept of overfitting. Then, we will introduce validation and some useful strategies you can use in practice to perform this important step.
Overfitting: your worst enemy
Overfitting means fitting the data more than is warranted.
It is the phenomenon where fitting the observed data well no longer indicates that we will get a good performance, and may actually lead to the opposite effect: in fact, many Machine Learning models have the ability to memorize the training set, which can lead to poor efficiency on unseen examples.
In those cases, we obtain a model that perfectly approximates your data, but it doesn’t have the ability to generalize to new examples.
The most important part when you design a model is recognizing overfitting and understanding the underlying causes, in order to combat it.
Underfitting is the opposite problem: it occurs when we fit data not enough, leading to bad approximation in addition to poor generalization.
To choose the best model, we want to avoid underfitting on the one side and overfitting on the other.
Overfitting causes
Let’s start by identifying the causes of overfitting:
- Dataset size is the first culprit of overfitting: the fewer examples for training, the more models can fit our data. Consider an extreme example, a sample with only one training point: any model will be able to “explain” it! As the training set size increases, fewer models are able to explain them, leading to decreased overfitting.
- Stochastic noise is due to measurement errors or random fluctuations in your data. It is stochastic in the sense that it is random: each time we generate the sample, the stochastic noise is different. We cannot directly intervene to limit this kind of noise, it doesn’t depend on the model and therefore on our choices.
- Deterministic noise appears when the phenomenon being modeled is too complex. In contrast with stochastic noise, deterministic noise is fixed for a given sample, but also in this case we can’t capture or model it.
One can think to contrast noise (stochastic and deterministic) by using a complex model. But is that really the case? NO!
If the level of noise is high and the dataset is small, using a complex model makes it worse. In fact, a complex model makes use of its additional degrees of freedom to fit noise, which can result in overfitting.
Let’s try to understand this concept in a very simple example of a one dimensional regression problem with polynomials features. (For further details about the code take a look at the notebook here).
Imagine we have noisy samples (the blue points in the plot) from a real function we want to approximate, in this case the sine function (the orange curve).
Then, we make use of three regression models with polynomial features of different degrees: 1, 3 and 50.
In the picture above we can visualize the results of our model’s predictions.
What we see is that a too simple model leads to underfitting: it can’t capture underlined relationships and we will get poor results.
Instead a complex model causes overfitting: in fact, the fitted curve passes exactly through each data point, but it oscillates wildly and gives a very poor representation of our target function!
The compromise is a simple model of degree 3: it doesn’t fit perfectly the target function, but it performs significantly better than the complex model of degree 50, that fits noise and is so far from our target function.
But does this mean that a simple model is always preferred to a complex model? Cleary not, in fact if you are dealing with a large dataset without noise, there is no reason to avoid a complex model.
The same procedure, if the dataset is without noise, leads to different results…
As you can see, in this case a complex model performs better than a simple one!
How to combat overfitting?
There are two ways to cure overfitting, often used together: regularization and validation.
- With regularization one can put constraints to the learning process to improve the generalization performances of the model, and therefore its quality. The need and the amount of regularization depends on the quality and the quantity of your data: too much regularization can lead to underfitting, because the model is not free enough to fit the data well. In most of the models you will use diving in scikit.learn, you can find the regularization term as hyperparameter.
- Validation allows you to keep an eye on the model performance, in order to prevent overfitting. This will be the central theme of the next section.
Notice that the setting of optimal regularization parameters is done through validation!
Validation
To contrast overfitting we need to correctly understand the quality of our model, i.e. we want to check if the model gives expected results on unseen data.
The problem is that the performance of our model can differ on train data (available) and on the test data (often not available), because the learning algorithm could just memorize all examples from the train data and be completely useless on unknown data.
So, usually, we divide the data we have into two parts, the training and validation part. We fit our model on the train part and check its quality on the validation part.
Only in this way we can select the model which will be expected to get the best quality on the test data, and adjust its hyperparameters (and often regularization is represented by one of these parameters): this could mean the choice between a linear or non linear model, the choice of the degree in a polynomial model, the choice of the value of a regularization parameter.
This step is called model selection and it is of paramount importance when you design a ML model: the goal is to select the best model and the optimal parameters for that model.
Train, Validation, Test sets
In this section we will show how validation works in a practical and interactive way.
Let’s introduce a useful dataset from sklearn.datasets: the breast cancer wisconsin dataset, a classic and very easy binary classification dataset. It is a collection of medical images of breast cancer together with their cancer characteristics. The task consists in building a classifier that is able to recognize if a patient has a benignant or malignant breast tumor.
from sklearn import datasetsdataset=datasets.load_breast_cancer() # load datasetX, y = dataset.data, dataset.target # create input and target
Let’s create our training and test set, splitting the dataset into two random parts:
from sklearn.model_selection import train_test_splitXtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=50)
Remember that the training set is used to learn the model, instead the test set is used to test the quality of your model, and in many ML competitions is hidden to participants.
Now, let’s try to design our ML model: to perform the binary classification task I have chosen a Support Vector Machine. Is it a good choice?
from sklearn.svm import SVCSV=SVC(random_state=0).fit(Xtrain, ytrain) #train the model
SV.score(Xtrain, ytrain) #calculate the score>>> 1.0 # max score!
Our model has learned the training examples perfectly! But is that really the case? Have we settled on a perfect classification scheme? No!
To accurately measure the performance, we need to use a separate set of examples, which the model has not yet seen. Of course, this is the role of the test set…
SV.score(Xtest, ytest)>>> 0.6293706293706294 # low score
The model is not able to generalize to new examples: this is the well known OVERFITTING!
In this case we learn two important things.
First, the fact that the model gives accurate results on training examples (score 1.0), only means that it has learned completely the training examples, i.e. it fits perfectly our data, and it is okay.
But it doesn’t mean it will work equally well on new examples: it doesn’t guarantee a good generalization (score 0.63). It’s our job to find the right balance between approximation and generalization!
Second, the main rule you should know: never use training examples to measure the quality of a model! It produces a false prediction of the goodness of your work.
But… that’s what validation is for!
With validation, one can immediately notice if your model is overfitting, and correct it in time.
The basic idea is to split the data with labels into two parts, one is used for the training itself, the other for validation. Validation set try to reproduce the test set, but this time, you can use the resulting information to correct the model (generally you can’t do this with the real test set, often it is unknown, mostly in ML competition).
An important thing to keep in mind when splitting your dataset in train and validation parts is that samples between train and validation must not overlap: if we evaluate our model with an example yet seen, the performance of the model increases, but we well know that is not a good thing. This is sometimes the case when we have repeated samples in the data: be aware to exclude duplicates before starting the validation steps!
And now, the time has come to create our validation set!
We apply another time the train_test_split command, but this time in order to divide the set in training and validation parts:
train_x, valid_x, train_y, valid_y = train_test_split(Xtrain, ytrain, test_size=0.2, random_state=20)
Now, we can train our model on the ‘new’ training set and test its performance on the validation set:
SV=SVC(random_state=0).fit(train_x, train_y) #train
SV.score(valid_x, valid_y) #calculate the score>>> 0.627906976744186
We can notice in time that this model is not too good, and avoid using it!
Let’s try with other models… (this is the model selection step)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifierRF=RandomForestClassifier(random_state=0).fit(train_x, train_y)
RF.score(valid_x, valid_y)>>> 0.9767441860465116clf = LogisticRegression(random_state=0).fit(train_x, train_y)
clf.score(valid_x, valid_y)>>> 0.9534883720930233DT=DecisionTreeClassifier(random_state=0).fit(train_x, train_y)
DT.score(valid_x, valid_y)>>> 0.9418604651162791NB=GaussianNB().fit(train_x, train_y)
NB.score(valid_x, valid_y)>>> 0.9418604651162791MLP=MLPClassifier(random_state=0).fit(train_x, train_y)
MLP.score(valid_x, valid_y)>>> 0.9302325581395349
And the winner is… Random Forest Classifier! We can now stay safe and use this classifier to perform our binary classification on the breast cancer dataset.
Notice that these validation schemes are supposed to be used to estimate the quality of the model. When you’ve found the right model and hyper-parameters and want to get test predictions, don’t forget to retrain your model using all training data.
#retrain the model with the entire and original training setfinal_model=RandomForestClassifier(random_state=0).fit(Xtrain, ytrain)
final_model.score(Xtest, ytest) #the final evaluation>>> 0.9300699300699301
We have correct overfitting (almost) by using a different model.
(Another crucial step in model selection is hyperparameter tuning, that consists in adjusting the intrinsic parameters that characterize each model. In this case we are able to combat definitively overfitting! You will see in detail this theme in a future article.)
This validation strategy is called Hold Out and is the first method you can use to perform the validation step. It is useful when you’re dealing with big datasets and in situations in which different splits lead to similar validation scores. But what happens if the dataset size is limited? Or when different ways to divide the dataset recommend to use different models?
Check out the next article to discover more validation strategies and what to do in such cases!
You can find my notebook with the entire code here.
References:
- ‘Learning From Data, a short course’, Hsuan-Tien Lin, Yaser S. Abu-Mostafa, Malik Magdon-Ismail
- Underfitting vs. Overfitting