Relevant Metrics for Classification Problems

Jayanta Parida
4 min readSep 6, 2022

--

Some of the relevant metrics that could be used to evaluate the performance of a machine learning model dealing with classification tasks. Before going through all the metrics we have to prepare a confusion matrix.

Confusion Matrix — A confusion matrix is a 2*2 or larger, matrix showing the number of correctly or incorrectly predicted samples for each class.

TN (True Negatives) — This is the number of samples whose true class is 0 and the model has correctly classified them as such.

FP (False Positives) — This is the number of samples whose true class is 0 but the model has incorrectly classified them as 1s.

FN (False Negatives) — This is the number of samples whose true class is 1 but the model has incorrectly classified them as 0s.

TP (True Positives) — This is the number of samples whose true class is 1and the model has correctly classified them as such.

Using the above confusion matrices, we will see 4 useful metrics for evaluating the performance of a classifier.

Accuracy is the ratio between the number of all correctly predicted samples and the number of all samples.

Precision is the ratio between the number of true positives and the number of all samples classified as positive.

Recall is the ratio between the number of true positives and the number of all samples whose true class is positive. This is also referred to as True Positive Rate (TPR).

Similarly, we have False Positive Rate (FPR). The formula for the same is

F1 Score is the harmonic mean of precision and recall. Basically, it is used to put precision and recall into a single metric by penalizing low values more heavily.

We generally use Accuracy, Precision, Recall, and F1 in the case of an equal number of positive and negative samples in the binary classification problem. There are a few more metrics that can be used for evaluating classification models based on different characteristics of the problem. These are

Area Under Curve (AUC) — When we plot TPR against FPR, we will get a curve, and the area under this curve is referred to as AUC.

AUC = 1, which means a perfect model

AUC = 0, implies a perfect model

AUC = 0.5, means the predictions are random

Log Loss — This is another metric based on AUC which penalizes high for an incorrect or far-off prediction.

So, the metrics we have seen as of now are for binary classifications. In the case of multi-label classification, each sample can have one or more classes associated with it. So the metrics for this type of classification would be different. Some common metrics used are:

Precision at K(P@K) — It is the number of hits in the predicted list considering only top-k predictions, divided by k.

Average Precision at K (AP@K) — It is based on P@K. If we have to calculate AP@K, we calculate P@1, P@2,…, P@k, and then divide the sum by k.

Mean Average Precision at K (MAP@K) — It is just an average of AP@K.

P@K, AP@k, and MAP@K values range from 0 to 1 and 1 being the best value.

These are all the metrics that will apply to almost all classification problems. Once we are confident about what metric to use for a given problem, we can look into the models in more depth for improvements.

--

--