Machine Learning — What you need to know about “Model Selection and Evaluation”

Alaa Khaled
Jul 27, 2017 · 6 min read

To illustrate the model selection task, I will consider the problem of learning one dimensional/ simple linear regression. Suppose that the training set can be plotted as the following figure:

Training data set

We can consider fitting a polynomial to the data. However, we might be uncertain regarding which degree d would give the best results for our data set: A small degree may not fit the data well (i.e., it will have a large approximation error), whereas a high degree may lead to overfitting (i.e., it will have a large estimation error). In the following we depict the result of fitting a polynomial of degrees 2, 3, and 10.

Although it seems that the empirical error decreases as we increase the degree, our intuition may tell us that setting the degree to 3 may be better than setting it to 10.

In model selection tasks, we try to find the right balance between approximation and estimation errors. More generally, if our learning algorithm fails to find a predictor with a small risk, it is important to understand whether we suffer from overfitting or underfitting.

There are 2 approaches for model selection.

  1. The 1st approach is based on the ‘Structural Risk Minimization (SRM)’ : it is useful when the learning algorithm depends on a parameter that controls the bias-complexity tradeoff (such as the degree of the fitted polynomial in the preceding example).

2. The 2nd approach relies on the concept of ‘Validation’ :the basic idea is to partition the training set into 2 sets. One used for training each of the candidate models, and the second is used for deciding which of them yields the best results.

Let’s discuss the 2 approaches in more detail:

Model Selection Using (SRM) —

In any ML problem we specify a hypothesis class H, which we believe includes a good predictor for the learning task at hand. In SRM paradigm, we specify a weight function which, assigns a weight to each hypothesis class such that a higher weight reflects a stronger preference for the hypothesis class. For example, in the problem of polynomial regression mentioned, we can take Hd to be the set of polynomials of degree at most d.

Validation can be divided in those 5 approaches —

1. Hold out set:

It is done by sampling an additional set of examples, independent of the training set, then we use the empirical error on this validation set as our estimator.

2. Validation of model selection:

Validation can be used for model selection. We first train different algorithms (or the same algorithm with different parameters) on the given training set. For example, in the case of training polynomial regressors, we would have each hr be the output of polynomial regression of degree r. Now, to choose a single predictor from H we sample a fresh validation set and choose the predictor that minimizes the error over the validation set. To illustrate how validation is useful for model selection, In the following we depict the same training set shown above, polynomials of degree 2, 3, and 10, but this time we also depict an additional validation set (marked as red, unfilled circles). The polynomial of degree 10 has minimal training error, yet the polynomial of degree 3 has the minimal validation error, and hence it will be chosen as the best model.

3. Model Selection Curve:

The model selection curve For the polynomial fitting problem, shows the training error and validation error as a function of the complexity of the model considered.

model selection curve

As can be shown, the training error is decreasing as we increase the polynomial degree (which is the complexity of the model in our case). On the other hand, the validation error first decreases but then starts to increase, which indicates that we are starting to suffer from overfitting.

4. K-fold Cross Validation

in some applications, data is scarce and we do not want to “waste” data on validation. The k-fold cross validation technique is designed to give an accurate estimate of the true error without wasting too much data.

In k-fold cross validation the original training set is partitioned into k subsets (folds) of size m/k (for simplicity, assume that m/k is an integer). For each fold, the algorithm is trained on the union of the other folds and then the error of its output is estimated using the fold. Finally, the average of all these errors is the estimation of the true error. The special case k = m, where m is the number of examples, is called leave-one-out (LOO).

k-Fold cross validation is often used for model selection (or parameter tuning).

K-fold cross validation pseudo code

5. Train-validation-Test Split

We split the available examples into three sets. The first set is used for training our algorithm and the second is used as a validation set for model selection. After we select the best model, we test the performance of the output predictor on the third set, which is often called the “test set.”

What to do If your model fails?

There are many elements that can be “fixed.” The main approaches are listed in the following:

  • Get a larger sample
  • Change the hypothesis class by:

— Enlarging it

— Reducing it

— Completely changing it

— Changing the parameters you consider

  • Change the feature representation of the data
  • Change the optimization algorithm used to apply your learning rule

To understand the cause of the bad performance; we have to understand that the true error is decomposed of approximation and estimation error.

The approximation error of the class does not depend on the sample size or on the algorithm being used. It only depends on the distribution D and on the hypothesis class H. Therefore, if the approximation error is large, it will not help us to enlarge the training set size, and it also does not make sense to reduce the hypothesis class. What can be beneficial in this case is to enlarge the hypothesis class or completely change it (if we have some alternative prior knowledge in the form of a different hypothesis class). We can also consider applying the same hypothesis class but on a different feature representation of the data.

The estimation error of the class does depend on the sample size. Therefore, if we have a large estimation error we can make an effort to obtain more training examples. We can also consider reducing the hypothesis class. However, it doesn’t make sense to enlarge the hypothesis class in that case.

To Summarize:

1. If learning involves parameter tuning, plot the model-selection curve to make sure that you tuned the parameters appropriately.

2. If the training error is excessively large consider enlarging the hypothesis class, completely change it, or change the feature representation of the data.

3. If the training error is small, plot learning curves and try to deduce from them whether the problem is estimation error or approximation error.

4. If the approximation error seems to be small enough, try to obtain more data. If this is not possible, consider reducing the complexity of the hypothesis class.

5. If the approximation error seems to be large as well, try to change the hypothesis class or the feature representation of the data completely.

References:

Understanding Machine Learning From Theory To Algorithms

P.S. I will post A post on ‘Classification Models Performance Evaluation’ very soon.

Alaa Khaled

Written by

Data Scientist, I enjoy Cosmology and Psychology. Travelling around the world is my Ultimate Goal.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade