Data Science and ML

Evaluation Metrics — Classification

10 min readFeb 16, 2022

Understand the Easy Way

Years ago when I was backpacking through western Europe…

Evaluation metrics are what make a Machine learning model show how evil it was under the hood.

Well, that being said, evaluation metrics for classification are quite different from that of regression.
The above statement is made keeping in mind that the people reading this article are familiar with regression evaluation metrics.

OUTLINE:

What are evaluation metrics?
Why are evaluation metrics of Classification different from Regression
What is a confusion matrix?
What are some most commonly used classification metrics?
Classification Metrics using Python

1. What are Evaluation Metrics?

Evaluation metrics, as the name suggests are metrics to evaluate the performance of our model. They give us a measure of how well our model is performing.
Evaluation metrics can vary for different types of problems — Regression, Classification, Clustering, etc.

For eg., when we deal with Regression, the evaluation metric is the error made by our model. By error, I mean the difference between the predicted points and the original points. In simpler terms, this means how accurate we were while telling that ‘hey this point will give an output of 12.5 (or some other arbitrary value).’
Now there are several ways to determine this difference — MAE, MSE, RMSE, etc., which depend upon the problem type and ease of evaluation and calculation.

Similarly for Classification, the evaluation metric is generally the accuracy in predicting the label class. By accuracy, I mean how accurate our predictions were out of all the predictions. In more simpler terms, this means how accurate we were while telling that ‘hey this point falls in category A and not B.’

2. Why are evaluation metrics of Classification different from that of Regression?

Well, the answer to that is quite a simple one.

Since in regression, we predict a continuous value, the original value is continuous and the prediction is also continuous. Now, since all the values are continuous, the error in prediction will also be continuous.

Take it this way, suppose my y-predicted (ÿ) is 10.5, and my y-original(y) is 11.7
Now the error in my prediction was the difference between my original value and my predicted value, which is y-ÿ, i.e.11.7–10.5 = 1.2
This value 1.2, is my error in predicting the value of y, and it is a continuous value.

Well, in the case of classification, since we are required to predict the category/class/label-type of the test point, the prediction is never continuous. In fact, the smart readers familiar with classification, know that classification predicts a discrete value whereas regression predicts a continuous value.

Take it this way, suppose my y-predicted(ÿ) is class B(denoted by 0) and my y-original(y) is class A(denoted by 1). Then the error in my prediction is y-ÿ = 1–0 = 1.
Does it make any sense?
No!!!

If we use regression metrics for classification, then the error in predicting a discrete variable(labels, classes, categories) can either be fully wrong(0) or fully correct(1). There is no possibility of partially correct prediction or partially incorrect prediction (a continuous value), which is often the case with regression.

And therefore, the evaluation metrics used with regression cannot be used with classification.

Buckle up!, there is more coming!

3. What is a Confusion Matrix?

Well, this part is quite a brainer for someone who is not familiar with it already, so go drink some coffee, have a look outside for an eye break, and then sit back!

A confusion matrix is what is used by every person who has his hands wet in the learning of machines. It is the most popular performance measurement tool.

But, what is it?
It is a matrix-like thing, that displays the predictions made by our model in a very fancy way such that we get all the evaluation metrics in one image. The metrics we can determine from a confusion matrix are the accuracy, true positive rate, true negative rate, false-positive rate, error rate, misclassification rate, and much more.
And, the image below is how you’d mostly see a confusion matrix in practice. It is randomly filled with values.

Most boring people use “Blues” as the color map. Don’t be one of them.

Q: So, let’s take an example of spam classification.
We have to create a model that classifies emails as spam and not spam. Consider an email that was spam as positive(1) and an email that was not-spam as negative(0).

Confusion Matrix for binary classification

This can also be shown as below:

This image is copied. Well, I think education should be free…

Don’t freak out much, lemme state the terms used!

TP — True Positives — the email was actually a spam(1), and we predicted it as spam(1)— Truly Spam — Truly Positive — True Positive — TP.
TN — True Negative — the email was actually not a spam (0), and we predicted it as not spam (0) — Truly Not-spam — Truly Negative — True Negative — TN.
FP — False Positive — the email was actually not a spam (0), and we predicted it as spam(1)— Falsely Spam— Falsely Positive — False Positive — FP.
FN — False Negative — the email was actually a spam(1), and we predicted it as not spam (0)— Falsely Not-spam — Falsely Negative — False Negative — FN.

Moving on…

So, the spam classification results can just be brought into this matrix and then we can calculate the model performance evaluation metrics. Look at the image below:

Looking at the picture, we can see that the total number of emails we have are 300. So now let’s say, we trained a fancy-ass model and it predicted some emails as spam and some as not spam. The entries are filled into the matrix.

Now you might wanna write down the below numbers somewhere in order to be on the same track.

Looking at the matrix above, we can see that the model has predicted a total of 30 emails correctly as spam and 250 correctly as not spam. These are the TPs and TNs, respectively.
We can also see that the model has predicted 12 emails as spam, whereas they actually were not-spam. And, it has also predicted 8 emails as not spam, whereas they actually were spam. These are the FPs and FNs, respectively.
Out of all the predictions,42(30+12) were predicted as spam and 258(8+250) were predicted as not spam (PREDICTED sum). Whereas in reality, 38(20+8) were spam and 262(12+250) were not spam (ACTUAL sum).

Now from this data, we can calculate how accurate our model was in predicting the emails as spam and not spam. This is nothing but the accuracy of our model, and can be calculated as:

Accuracy = Total number of correct predictions/total number of predictions made overall

Accuracy = (30+250)/(30+250+8+12) = 280/300 = 0.9333

Our model has an accuracy of 93.33%. And, this is just calculated from the confusion matrix without any other measures required. Like this, we can also calculate the True Positive Rate (how many were truly positive out of all the actual positives), error rate, false-positive rate, and much more.
Well, the next section is for that part only.

A smart note:
The cells where the row-heading is the same as the column-heading are correctly predicted cells, and the rest are incorrect predictions.

Or, just remember this, the diagonals are True Positives and True Negatives and anti-diagonals are False Positive and False Negatives for a 2D confusion matrix (binary classification).

Confusion matrix for multi-class classification with random samples

For dimensions more than 2, everything except the main diagonal is what the model predicted wrong. Look at the tiny image above. The ones in the red are wrongly predicted.

Go through it again if it’s not clear, coz the next section is nothing but fancy names given to the basic math we do on this matrix.

4. What are some most commonly used Classification Evaluation Metrics?

Some of the most commonly used classification evaluation metrics are stated below:

Accuracy — The measure of correct prediction among all the predictions
Recall — (True Positive Rate) — The measure of correct positive predictions among all the actual positives
Precision — The measure of correct positive predictions among all the predicted positives
F1-score — An optimal blend of Recall and Precision

Considering this example again.

And for people who are high, ‘#’ means ‘number of’.

1. Accuracy

Accuracy is the measure of how correct our model predicts. It is the measure of how many correct predictions we made out of all the predictions. Since TP and FN account for the number of correct predictions, the formula for Accuracy would be :

ACCURACY = # correct predictions / all predictions

Accuracy =( TP+TN) / ( TP+TN+FP+FN)

For the above spam classification, the accuracy score can be given by : Accuracy = (30+250)/(30+250+8+12) = 280/300 = 0.9333
Accuracy = 0.9333 = 93.3%

It means that the model is 93.3% accurate in predicting positives as positives and negatives as negatives

2. Recall

Recall is the measure of correctly predicted positives out of all the actual positives. It is the measure of how correctly we predict all the spams out of all the actual spam emails. Since the correct positive predictions are only TP, and the actual positive labels are TP and FN, hence the formula for recall would be :

RECALL = # correct predicted positives / # actual positives

Recall = TP / (TP + FN)

Recall can also be stated as how sensitive our positive predictions were to the correct positive labels. Hence it is also called Sensitivity or True Positive Rate(TPR) — The rate of correct positives(TP) out of actual positives.

For the above spam classification, the recall can be given by:
Recall = 30/(30+8) = 30/38 = 0.7894
Precision = 0.7894= 78.94%

It means that out of all the actual positive labels, only 78.94% were predicted as positive.

3. Precision

Precision is the measure of correctly predicted positives out of all the predicted positives. It is the measure of how precisely we predicted the actual spams out of all the emails that were predicted as spams. Since the correct positive predictions are only TP, all positive predictions are TP and FP(were actually negative but predicted positive). Hence the formula for Precision would be:

PRECISION = # correct predicted positives / total # predicted positives

Precision = TP / (TP + FP)

For the above spam classification, the recall can be given by:
Precision = 30/(30+12) = 30/42 = 0.7142
Precision = 0.7142= 71.42%

It means that out of all the positive predictions, only 71.42% positive predictions were correct.

4. F1-Score

F1-score is an optimal blend of precision and recall. For problems where we need to reduce both the FPs and the FNs, we use the f1-score, which provides a balance between recall and precision. F1-score is the harmonic mean(HM) of precision and recall. Harmonic mean is chosen, because when compared to AM(Arithmetic Mean) and GM(Geometric Mean), HM penalizes the model most when even one of Precision and Recall is low.

F1-SCORE = Harmonic mean of Precision and Recall

F1-Score = 2*(Precision*Recall) / (Precision + Recall)

For the above spam classification, the recall can be given by:
F1-Score = 2*0.714*0.789/(0.714+0.789) = 1.126/1.503 = 0.7491
F1-Score = 0.7491 = 74.91%

It means that our model has a balanced precision-recall of 74.91%

For the ones confused between Precision and Recall, here’s a hack:

Precision — What proportion of positive identifications were actually correct?
Recall — What proportion of actual positives were identified correctly?

Some other evaluation metrics:

Misclassification rate — Error rate — total number of wrong predictions out of all the predictions

Error rate = (FP + FN) / (TP + FP + TN + FN)

False Positive Rate — incorrect negative predictions(were negative but predicted positive) out of actual negatives labels

FPR = FP / (FP + TN)

True Negative Rate —Specificity — correct negative predictions out of actual negative labels

TNR = TN / (TN + FP)

The later ones are not very important in terms of their use. The formers are mostly used to check the model’s performance and plot beautiful curves like the AUC-ROC curve and Precision-Recall Curve.

5. Evaluation Metrics in Python

The code for calling any classification metric in Python is just plain English. Well, look it up for yourself :

We can use the above code to get any metric we want as per our needs. But there is a clever way of displaying all the metrics for our result in just one line of code :

The above code will print the entire classification report for our model results in a tabular form representing the accuracy, recall, precision, and f1-score corresponding to each class like this :

Well, there are many more cool functions inside the sklearn.metrics API and I would recommend you to check the documentation here.

~And that’s it for this one.

Check out similar articles —

A One Stop for Support Vector Machine

Support Vectors? Machine? And, why isn’t Oswald Mosely dead?

medium.com

One Stop For Logistic Regression

Logistic Regression? Why is it called Regression? Is it linear? Why is it so popular? And what the hell is log odds?

medium.com

A One Stop for KNN

Nearest Neighbors? How near are these neighbors? Well, I hope they don’t bite!

medium.com

A Stacked Ensemble Technique for DFU detection

A detailed overview for developing a stacked ensemble model for the early detection of Diabetic Foot Ulcers(DFU)

medium.com

Data Science and ML

Evaluation Metrics — Classification

1. What are Evaluation Metrics?

2. Why are evaluation metrics of Classification different from that of Regression?

3. What is a Confusion Matrix?

4. What are some most commonly used Classification Evaluation Metrics?

1. Accuracy

2. Recall

3. Precision

4. F1-Score

Some other evaluation metrics:

5. Evaluation Metrics in Python

A One Stop for Support Vector Machine

Support Vectors? Machine? And, why isn’t Oswald Mosely dead?

One Stop For Logistic Regression

Logistic Regression? Why is it called Regression? Is it linear? Why is it so popular? And what the hell is log odds?

A One Stop for KNN

Nearest Neighbors? How near are these neighbors? Well, I hope they don’t bite!

A Stacked Ensemble Technique for DFU detection

A detailed overview for developing a stacked ensemble model for the early detection of Diabetic Foot Ulcers(DFU)

Written by Priyansh Soni