Evaluation Metrics Part 1
For Classification Models!
When building a statistical or machine learning model, it is very important to evaluate the model. Evaluation metrics are used to evaluate the quality of the machine learning model and comprise an essential part of any project. To make improvements to our model, to achieve desirable performance, it is necessary to use such metrics to get feedback, to have an idea about the robustness or the generlization capability of the model.
Let us first look into some metrics which are often used to express other metrics easily
True Positive (TP)
Number of samples which are predicted positive and are also labeled as positive.
True Negative (TN)
Number of samples which are predicted negative and are also labeled as negative.
False Positive (FP)
Type-I error
Number of samples which are predicted positive but are actually labeled negative.
False Negative (FN)
Type-II error
Number of samples which are predicted negative but are actually labeled positive.
Confusion Matrix
Confusion matrix is a table used to describe the performance of a classification model. Confusion matrix is composed of four parts, TP,TN,FP,FN (which have already been discussed). It is very useful for measuring these evaluation metrics.
Accuracy
Accuracy is the proportion of the samples predicted correctly. It is one of the most common metric used. Accuracy can be expressed in terms of above four metrics as:
However, this metric doesn’t prove to be very useful in case of imbalanced dataset. Suppose, in a dataset, there are psotive and negative samples in the ration of 83 : 17. And the model you build, predicts positive for all samples, then for your model, TP, TN, FP, FN values are 83, 0, 17 and 0 respectively. The accuracy of the model is then calculated to be 83%. However, this model you built does not predict any of the negative samples correctly, and this is not at all desirable.
Two metrics which can resolve this issue we can face when using accuracy, by providing with a different outlook towards the problem at hand, are Precision and Recall.
So, what is this different outlook?
Precision
Positive Predictive Value
Precision is the proportion of the samples which are truly positive out of all the positively predicted samples. It actually gives us the fraction of samples which were correctly predicted, out of all the positively predicted samples.
Recall
Probability of Detection
Recall is the proportion of the samples which are correctly predicted positive out of all the actually positively labeled samples. That is, it gives us the fraction of the positively labeled samples which have been correclty predicted by the model.
Sensitivity
True Positive Rate, Power
Sensitivity is same as Recall.
Specificity
True Negative Rate
Specificity is the proportion of negatively labeled samples which are predicted correctly, that is, predicted negative.
F1 Score
F1 score is the weighted Harmonic Mean of Precision and Recall. In case of imbalanced dataset, F1 score is a good evaluation metric to be used, as it also takes into account FP and FN, along with TP and TN.
General formula for F-score is
F1-score or balanced F-score is expressed as
Precision-Recall Curve
A precision-recall curve shows the relation between precision and recall for every possible threshold value. Recall is plotted along the x-axis and Precision is plotted along the y-axis. One important note about PR curve is that TN is not used in making the PR curve.