Classification Model Evaluation Metrics In A Nutshell: Part I

sanjukta sarker
4 min readJan 12, 2024

--

To evaluate a classification model evaluation metrics are crucial to assess model performance and discern accurate predictions. However, each metrics has its pros and cons, and for an optimum evaluation it’s imperative to know how and which metrics to use.

In this section I have discussed three important metrics to evaluate classification performance explaining why and when to use these

Topics discussed:

  • Accuracy
  • Confusion matrix
  • AUC/ROC

Accuracy

Accuracy is the ratio of correct predictions made with respects to all predictions. The formula is as follows:

A model with accuracy value of 1 means that it is capable of identifying all the True Positives and True Negatives 100% accurately. Generally speaking, it is desired for a model to have at least more than 70% accuracy, however it varies depending on the model. The higher the accuracy, the better.

Pros:

  • It is easy to calculate
  • Easy to understand at a glance

In all models this is used as a primary evaluation metric.

Issues to consider

  • It is sensitive to an imbalanced data. For example for a binary classification if number of negative outcome is 30% and positive outcome in the data is 70% it is an imbalanced data. You need to check the the balance ratio of the data set.
  • Not robust to outliers. It is important to remove outliers or else presence of outliers may produce significant error in accuracy.

Confusion matrix

A confusion gives a numerical representation of of the types of correct and the wrong predictions made by the model.

Let’s consider an example, suppose we have to predict whether a patient is diabetic or not. If a patient is diabetic it is positive and if not it is considered as a negative value.

TP (True Positive): TP represents the correct predictions made by the model. In the case of diabetes detection, TP is the cases of where the value is positive in the dataset and the model predicts the correct positive values as well. For the diabetic dataset, TP is the predicted cases of diabetic patients.

TN (true Negative): It is the is the number of correct predictions made where a patient is non- diabetic. That is TN predicts negative values correctly, the values are actually negative in the dataset.

FP (False Positive): These are the cases when models detects a non-diabetic patient as diabetic. That is FN are the cases where the value is negative in the dataset but the model predicts a positive value, that is an incorrect value.

FN (False Negative): It is the opposite of FP. These are the cases when models detects a diabetic patient as non-diabetic. So, FP are the cases where the value is actually positive in the dataset but the model predicts a negative value. Another incorrect prediction.

Pros:

  • Provides a detailed breakdown of prediction performance.
  • Highlights a model’s ability to correctly identify positive instances, which is crucial in certain applications (e.g., medical diagnoses).
  • It highlights in which area the models lacks improvement,-some models may struggle to detect negative values, in such cases balancing the dataset, outlier removal may enhance the performance.

Issues to consider

  • Imbalance dataset may result in inaccuracy and under-performance. In such cases you might notice abundant True Negatives cases .

AUC/ROC

AUC/ROC is one of most popular evaluation metrics. It is widely used to represent the performance of the models. AUC stands for Area Under The Curve and ROC stands for Receiver Operating Characteristics.

ROC is the plotting of True Positive Rate (Sensitivity) against the False Positive Rate (1 — Specificity) for different threshold values. ROC tells the ability of a model to distinguish between classes.

AUC is a numerical measure representing the area under the ROC curve. AUC values range from 0 to 1, where 0.5 is the threshold value.

The first figure demonstrates the case of a perfect model where AUC value is 1 and the ROC curve lies strictly at top left corner. As the AOC curves overlap each other the model performance decreases and the worst is seen when the two curves overlap completely at the threshold.

Why should you use AUC/ROC?

  • AUC is independent of threshold which makes it less susceptible to imbalanced classes. So even if the model is slightly imbalanced, AUC/ROC can provide you a good measure of performance of the model !
  • The ROC curve explicitly visualizes the trade-off between sensitivity (true positive rate) and specificity (true negative rate) at different thresholds.

Issues to consider

  • AUC/ROC is designed for binary classification problems and may not be directly applicable to multiclass classification scenarios.
  • heavily imbalanced dataset introduces the risk of bias in the AUC/ROC curve resulting in an incorrect model evaluation.

--

--

sanjukta sarker

Hello! I'm a passionate ML enthusiast who writes breezy contents on AI, ML, and more. Hope you enjoy my writing!