Machine Learning Model Performance Evaluation | Classification

Published in

Let’s Deploy Data.

7 min readAug 5, 2020

This article discusses about performance evaluation for Machine Learning Classifiers models, it will highlight different types of metrics and graphs that are used to evaluate the performance of a Classifier model.

Topics: Classification, Confusion matrices, Accuracy, F1-Score, Precision, Recall, ROC curve, AUC curve, Precision-Recall Curve

CLASSIFICATION

First let’s get introduce with classification algorithm in a very brief and simplest way possible,

Classification Algorithms

Classification algorithms are supervised Machine learning algorithm meaning you train them with labeled input and the outputs are categorical or discrete.

Example:

For example, you might want to classify emails as spam or not spam; each of these is a discrete category.

3 types of Classification Algorithm :

Binary classification : Anomaly / Fraud detection, has only two class.
Multi-class single label classification : has more then 2 class, for instance recognition of number from image [0–9 class]
Multi class multi label: output can belong to multiple classes for instance, text tagging, you assign multi tags

Now let’s move to the main topic of this blog ……

MATRICES AND GRAPHS USED TO EVALUATE A CLASSIFIER

here i will introduce you with some of the matrices and graphs used to understand the performance of a classifier.

1. Confusion Matrices

A confusion matrix helps us to understand whether the model is getting confused and classifying the data wrongly.

Suppose that we have trained a simple binary classification model, Given an image, this model will indicate whether it is a picture of a cat or a picture of a dog.

Some terms to remember,

True positives are the positive cases that are correctly predicted as positive by the model
False positives are the negative cases that are incorrectly predicted as positive by the model
True negatives are the negative cases that are correctly predicted as negative by the model
False negatives are the positive cases that are incorrectly predicted as negative by the model

in the above image we can see that left diagonal (FP & FN) represents the number of wrong/false prediction by the model.

In the above image,

the actual CAT’S image = 50 + 2= 52 ( TP + FN )
the actual DOG’S image = 70+ 3= 73 ( FP + TN )

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

Total Predicted cat’s image = 50 + 3= 53( TP + FP )
Total Predicted DOG’S image = 70+ 2= 4 ( FN + TN)

The model correctly identified 50 + 70 (TP+TN) = 120 images

The false prediction made by the model = 3+2 (FP + FN)= 5

2. Precision

The formula for precision is, Precision = TP / ( TP + FP ) = correctly predicted positive value / all positive prediction

So, Precision is the proportion of positive cases that were correctly identified. so the closer it gets to 1 the better.

If precision is 1 , it means that all the positive data are classified as positive and the model didn’t classified positive data are incorrectly. which is pretty good. However Precision alone can not provide sufficient information about the classifier’s performance. because it is only considering the positive classification and we also want to identify all the negative value correctly.

WHEN TO USE PRECISION ?
As mentioned above precision alone is not enough to tell us about the performance of the classifier, but in some cases the value of precision is more significant, for instance
Suppose you are creating a model to recommend products to customers, in this case precise answer is required otherwise the consumer will get annoyed. so here reducing False-Positive is more significant and false-negative is less of a concern here.

3. Recall

The formula for recall is, recall= TP /( TP + FN )= correctly predicted Positive value / total actual Positive cases

Recall is the same as sensitivity.

Sensitivity = True Positives / (True Positives + False Negatives)

So, Recall is about proportion of actual positives that were identified correctly. so the closer it gets to 1 the better.

If recall is 1 , it means that all the positive data are classified as positive and the model didn’t classify any negative data incorrectly. on the other if it’s closer to zero that means the False Negative is increasing and the model is failing to identify negative data correctly.

However alike Precision, recall alone can not provide correct information about the classifier’s performance. if you have recall 1 (FN=0) and still be misleading as we may have higher false positive value then True positive.

WHEN TO USE RECALL?
Suppose you are creating a model to diagnose a disease from some sample, in this case we don’t want to misdiagnose a cancer patient (FN). in this case labeling non-cancer patient as cancer patient (FP) may not be as harmful as the previous case. so recall is more important then precision here.

4. Accuracy

Formula for accuracy, Accuracy = TP + TN / TF+FP+FN+TN

Accuracy is the proportion of true prediction among the total number of cases experimented.

WHEN TO USE ACCURACY ?
Accuracy is a good choice if the problem is balanced and not skewed.
Besides accuracy sometimes may not be a good choice for instance suppose we want to predict if the earth and moon will collide? now, the model predicting NO all the time , here precision and recall is 0. so the accuracy is more than 99%, but misleading in this case. so if target class is very sparse accuracy may become worthless.

5. F1 Score

The formula for F1-Score. F1= 2+ Precision * Recall / Precision + recall

F1 score is a number between 0 and 1 and is the harmonic mean of precision and recall.

WHEN TO USE F1-SCORE?
so I’m going to use the same example given above , suppose we want to predict if the earth and moon will collide? now, the model predicting NO all the time , here precision and recall is 0. so the accuracy is more than 99%, but here the F1 score is 0. and that what we are looking for.
So, F1 score maintains a balance between the precision and recall, If any of them gets lower so does the F1 score.

6. ROC Chart (Receiver Operating characteristics) and AUC (Area Under Curve)

You can visualization your model’s performance through The ROC curve, and you can get the AUC (area under curve) metric from it, there are some limits of values that will help you understand how good your model is.

Here, The y axis represents the true positive rate (TP/(TP+FN)), and the x axis represents the false positive rate (FP/(FP+TN)). both axis having limits from 0 to 1.

The area under the curve (AUC) is the area of a right triangle,

The diagonal line represents Random guessing line (blue dotted line).

Diagonal line/random guessing line has AUC = 1*1/2 = 0.5.

Your target should be to keep the AUC closer to 1, it ensures that your model is almost perfect.

If you have a curve under the middle diagonal line meaning AUC < 0.5, it indicated your model is worse than random guest.

6. Precision-Recall Curve(PRC)

Unlike ROC curve in precision-recall curve number of true negative results is not used.

In problems where huge class imbalance is present PRC curve can be better then ROC curve. Precision recall can be very useful in cases where the number of TN >>> TP. so PRC can be useful in fraud detection where Non-fraud transaction (TN) >>> Fraud transaction (TP)

A random classifier shows a straight line as P / (P + N). For instance, the line is y = 0.5 when the ratio of positives and negatives is 1:1, whereas 0.25 when the ratio is 1:3. (https://classeval.wordpress.com/introduction/introduction-to-the-precision-recall-plot/)

So how can we evaluate the classifiers performance from PRC curve?

Our target should be to keep the curve above random line. depending on the ratio of positive and negative class the random line can change. the right image shows random line for where the P:N = 1:3 and the left shows the random line for P:N = 1:1.