Understanding Model Validation for Classification

Pushkar Pushp
Analytics Vidhya
Published in
5 min readApr 22, 2020

In this blog we will walk through different techniques to validate the performance of classification model.

Define the problem : Predict whether it will rain tomorrow or not.

In the previous blogs you have seen different supervised algorithm to attack this problem.

Consider any supervised algorithm say as simple as logistic regression.

Data Set description : Rainfall data contains 118 features and one dependent variable (y_test) whether it will rain or not.

Sample of X_test

Features
Index(['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am',
'Humidity3pm',
...
'NNW.2', 'NW.2', 'S.2', 'SE.2', 'SSE.2', 'SSW.2', 'SW.2', 'W.2',
'WNW.2', 'WSW.2'],
dtype='object', length=118)

Sample of y_test

0     No
1 No
2 No
3 No
4 No
5 No
6 Yes
7 Yes
8 No
9 No
Name: RainFall, dtype: object

The best practice is to save the model so as to directly use for prediction in future.

Below code snippet can be used to save the model.

Let’s load the model .

Its time for prediction

We have all ingredient to cook our various evaluation dish.

Metric is a technique to evaluate the performance of the model.

List of various metric we will be covering in this blog.

  1. Accuracy
  2. Null Accuracy
  3. Precision
  4. Recall
  5. f1 score
  6. ROC — AUC
  7. Accuracy : This is the most naive and commonly used metric in the context of classification. It is just the mean of correct predictions.

We have just define a simple function to calculate the accuracy and evaluated it against our test data.

accuracy = 0.850170540455009

Instead of this , one can also used sklearn inbuilt score function to evaluate the accuracy . So we will calculate using sklearn and verify the accuracy we have obtained using the function above.

So both results matched .

2. Null Accuracy : It is defined as accuracy obtained when always predicting most frequent class.This is quite useful to check the absoluteness of model accuracy.

Null accuracy turns out be 0.7759414888005908 which is lower than model accuracy so we are good.

The question which immediately prop up in one’s mind is this complete information about model goodness. This is quite subjective , for example if we want to make less false prediction of rain .

Any classification model divides the prediction space into various sub space.

The best way to conceptualise this is via confusion matrix .

Let me draw a confusion matrix for our binary classification problem.

20892 is number of cases where we predicted it will rain and it actually rain.This is called true positive, quickly define other variables

  • True Positives Actual Rain and Predict Rain
  • True Negatives Actual No_Rain and Predict No_Rain
  • False Positives Actual No_Rain and Predict Rain
  • False Negatives Actual Rain and Predict No_Rain

In our case

  • TP = 20892
  • TN = 3286
  • FP = 1175
  • FN = 3086

accuracy = (Total correct prediction)/Total prediction

= (TP + TN)/( TP + TN + FP + FN)

Leave it to the reader to verify the accuracy matches the one we calculated. We move onto some other metrics.

3. Precision : It is defined as proportion of correctly predicted positive outcome among all prediction. In other words of all the predicted positive outcome how many of them are actually positive.

Therefore ,

Precision = TP /(TP + FP)

In our case precision = 20892/(20892 + 1175) = 0.9467530701953143

This is quite vital in medical scenario when a 👩‍⚕️ prescribes medicine to normal patient for disease ,it can led to severe health hazard.

To elaborate this ,when we want to minimise FP , in case of 👩‍⚕️ FP is falsely predicting disease.

4. Recall : It is defined as proportion of correctly predicting positive outcome among all actual positive. In other words of all the actual positive outcome how many of them we have been able to predict as positive.

Recall = TP/(TP + FN)

here recall = 20892/(20892 + 3086) = 0.8712986904662607

Recall is quite important when you want to minimise the case of FN.

Consider a test to detect Corona virus 🦠 it is primarily important to not miss the case when individual was positive and test fail to detect.

Again ,

Precision = TP /(TP + FP)

Recall = TP/(TP + FN)

If sample size is fixed say (n) ,then

TP + TN + FP + FN =n

In order to have high precision and recall both FP and FN negative should be as low as possible.There is a constrain to that , as lowering both means it’s an ideal scenario .

How do we combine both ?

Before we move on ,

Recall is also called true positive rate or sensitivity

Precision as true negative rate or specificity

4. f1 score: It is the harmonic mean of Precision and Recall. The obvious question is why harmonic mean(HM) and not arithmetic or geometric mean or some other transformation. I have written a separate blog on the explanation of HM to combine these two metric.

f1 = 2*precision*recall/(precision + recall)

here , f1 = 2*0.9467*0.8713/(0.9467 + 0.8713) = 0.9074

This measure is more contextual than accuracy , only it needs to be explained properly unlike accuracy which is easily interpretable.

Summary of above measure :

In python we have a module in sklearn , classification_report it generates all measures.

support is number of observation in each class.

5. ROC : The receiver operating characteristic curve plots TPR vs FPR for different threshold values. In other-words it shows model performance at different threshold level.

To understand this we need to understand the output of trained classifier.

Idx No_rain_prob Rain_prob
0 0.987328 0.012672
1 0.840071 0.159929
2 0.976743 0.023257
3 0.798552 0.201448
4 0.307343 0.692657

So the output of logistic regression or most classifiers are in terms of prob.

There is a default hyper-parameter called C(threshold) based on which this prob is converted into ‘No’ or “Yes”.

1 or No : No_rain

0 or Yes : Rain

if C = 0.5

predict_label = [No , No , No, No,Yes] == [1, 1 ,1,1,0]

if C = 0.8

predict_label = [No ,No ,No,Yes,Yes] == [1, 1 ,1,0,0]

ROC curve is generated by plotting TPR vs FPR for different threshold.

ROC AUC i.e Receiver Operating Characteristic — Area Under Curve ,measures area under the curve.

Higher the value better the model, best value is 1.

ROC _AUC score is 0.8729.

Hope this was helpful , feel free to comment .

--

--

Pushkar Pushp
Analytics Vidhya

Data Scientist | Deep Learning Practitioner | Machine Learning |Python | Cricket Blogger