Different methods to estimate Test Errors for a Classifier

Moving beyond the Validation set

Published in

Data Science in your pocket

5 min readSep 23, 2020

Assume you train a model ‘A’ for some training dataset. As the training error looks low, you now wish to deploy it on production. What would you do to check its performance on unseen data?

Pretty Simple

Prepare a validation set & calculate the error on this set. If the error is still low enough, you are ready to go !!

At least this is what I have been doing for the last 2 years 😀

But I have a few questions !!

1. Does this method(using a validation set) has any loopholes? If yes, what?
2. Are there other methodologies to estimate test error?

Answering the 1st part:

Yes !!

Assume you have just 100 samples to train. Dividing such scanty data in training & validation will create 2 problems:

As the training data is already very small, the model might not learn anything if it is further divided. We would wish to train the model on the entire data available.
Even if we divided it, the validation set might have just 20–30 samples at the highest. Should we believe in the error rates based on such a small set?

I feel no.

Such problems are pretty often when data has to be hand annotated & hence, you might have a small training set.

2. If the error rate for two potential models (say A & B) are approximately the same (statistically insignificant), which model should be picked up for production?

According to the Principle of Parsimony in respect of Machine Learning

If two models have same error rate, the model with lesser complexity should be chosen.

Unfortunately, the Validation set method doesn’t have a clue about this.

3. Validation set method doesn’t take into account many important factors like model size, latency, etc. which play a prominent role in model deployment. You must have heard of the famous case about a competition organized by Netflix for a recommendation system where the winning solution never got deployed !!!!

If not, check it out here

So, as we are pretty clear with our motive

Let’s explore different methods by which the model’s performance can be estimated.

Resubstitution Error Estimate

The most naive way to work when we can’t afford a validation set is to consider a training error as a validation/test error.

Simple.

Hence, whichever model has the lowest training error should be chosen.

But, this is hyper-optimistic, as mostly, training error is a very poor estimation of test error & should be avoided. This post can be a lot better at explaining why.

Pessimistic Error Estimate

Now this is quite interesting as it incorporates both the loopholes I mentioned in the validation set method as

You don’t need a validation set at 1st place.
Incorporates model complexity in the final error estimated.

For this, we need to set a Penalty_Constant say 0.5 for now

Test error using Pessimistic Error estimate

The final error term here has two major components:

Training error(e) + penalty(Ω). N is the total training samples.

This penalty term is the one representing model complexity.

How is this calculated?

This is something that is model dependent. Like for

Decision Trees, can be

number of nodes in the tree(Pn) X Penalty_Constant(Pc).

For a neural network, you can go with

number of nodes in the entire network(Pn) X Penalty_Constant(Pc)

Hence, for a Decision Tree with

e(T)=3,
nodes(Pn)=7
Training sample(N)=100
Penalty_Constant=0.5. Now using the above formula for pessimistic error
The final error = (3 + 7 X 0.5)/100=0.065

A few drawbacks though,

As nothing from the unknown samples is considered in test error, this estimation can go wayward like a resubstitution estimate.
Determining the complexity of some models can be tricky. Like in the case of KNN, what should be Pn? should it be the value of K be set?

Minimum Description Length Principle

This principle follows the ideology of information theory for estimating a final error term.

Assume I have 2 systems, X & Y.

X has a model ‘A’ with N training samples & their target values while Y just has N training samples without target values.
As Y also seeks target values corresponding to each sample, it requests X to transmit target values to it for each sample.
Now, X can send all target values but it will cost N bytes of information for N records which is expensive. Now, X thinks of an alternative & trains a model ‘A’, encodes it & sends it to Y. As Y already has training samples, the targets can be predicted
But what if model ‘A’ isn’t 100% accurate? Then Y will get some wrong labels. How to sort this out? X will be sending raw target values for misclassified samples as well alongside encoded model ‘A’.

Hence the total cost of transmission from X to Y =

Cost (Encoding Model) + Cost (Samples misclassified by model)<N

A few notes

The 1st term is directly proportional to model complexity as the simpler the model, the easier it is to encode it. A lot of things can also be incorporated like model size while calculating the final cost which is often ignored.
The 2nd term can be calculated using either 1) Validation set accuracy if possible 2) Training accuracy.
If the model is 100% correct, the 2nd term=0.
Whichever potential model has the lowest cost of Transmission should be chosen.

Estimating statistical bounds

As training error, mostly, is a poor estimation for test error, what can be done is a correction over this training error and as test error is usually larger than training error, a correction term can be added (& not subtracted; more of an upper bound) to make the test error estimate more accurate. The formulae to calculate these statistical bounds differ for different models & hence should be ignored for now. As for C4.5 Decision Tree, the upper bound is calculated using the below formula:

Where:

e=Training error
N=Total training samples
α = Confidence level
Zα= Z score according α in Standard Normal Distribution Table

As the formula looks monstrous, will skip it for now.

Having discussed the above methods, I still believe estimating test error using a Validation set is still the best choice for most (if not all) problems. Also, even Validation Set can be used in different ways like Hold Out, Cross Validation, Bootstrap, etc making the error estimation more accurate.

That’s all for today