Evaluation Metrics Part 1

For Classification Models!

Siladittya Manna
The Owl
4 min readJun 20, 2020

--

Image Source

When building a statistical or machine learning model, it is very important to evaluate the model. Evaluation metrics are used to evaluate the quality of the machine learning model and comprise an essential part of any project. To make improvements to our model, to achieve desirable performance, it is necessary to use such metrics to get feedback, to have an idea about the robustness or the generlization capability of the model.

Let us first look into some metrics which are often used to express other metrics easily

True Positive (TP)

Number of samples which are predicted positive and are also labeled as positive.

True Negative (TN)

Number of samples which are predicted negative and are also labeled as negative.

False Positive (FP)

Type-I error

Number of samples which are predicted positive but are actually labeled negative.

False Negative (FN)

Type-II error

Number of samples which are predicted negative but are actually labeled positive.

PICTORIAL REPRESENTATION

Confusion Matrix

Image Source

Confusion matrix is a table used to describe the performance of a classification model. Confusion matrix is composed of four parts, TP,TN,FP,FN (which have already been discussed). It is very useful for measuring these evaluation metrics.

Accuracy

Accuracy is the proportion of the samples predicted correctly. It is one of the most common metric used. Accuracy can be expressed in terms of above four metrics as:

Accuracy

However, this metric doesn’t prove to be very useful in case of imbalanced dataset. Suppose, in a dataset, there are psotive and negative samples in the ration of 83 : 17. And the model you build, predicts positive for all samples, then for your model, TP, TN, FP, FN values are 83, 0, 17 and 0 respectively. The accuracy of the model is then calculated to be 83%. However, this model you built does not predict any of the negative samples correctly, and this is not at all desirable.

Two metrics which can resolve this issue we can face when using accuracy, by providing with a different outlook towards the problem at hand, are Precision and Recall.

So, what is this different outlook?

Precision

Positive Predictive Value

Precision is the proportion of the samples which are truly positive out of all the positively predicted samples. It actually gives us the fraction of samples which were correctly predicted, out of all the positively predicted samples.

Precision

Recall

Probability of Detection

Recall is the proportion of the samples which are correctly predicted positive out of all the actually positively labeled samples. That is, it gives us the fraction of the positively labeled samples which have been correclty predicted by the model.

Recall

Sensitivity

True Positive Rate, Power

Sensitivity is same as Recall.

Sensitivity

Specificity

True Negative Rate

Specificity is the proportion of negatively labeled samples which are predicted correctly, that is, predicted negative.

specificity

F1 Score

F1 score is the weighted Harmonic Mean of Precision and Recall. In case of imbalanced dataset, F1 score is a good evaluation metric to be used, as it also takes into account FP and FN, along with TP and TN.

General formula for F-score is

F1-score or balanced F-score is expressed as

Precision-Recall Curve

A precision-recall curve shows the relation between precision and recall for every possible threshold value. Recall is plotted along the x-axis and Precision is plotted along the y-axis. One important note about PR curve is that TN is not used in making the PR curve.

Image Source

Check out Part 2, 3 and 4 for more on Evaluation Metrics and how to measure uncertainty related to evaluation metrics.

--

--

Siladittya Manna
The Owl

Senior Research Fellow @ CVPR Unit, Indian Statistical Institute, Kolkata || Research Interest : Computer Vision, SSL, MIA. || https://sadimanna.github.io