MACHINE LEARNING

Advanced Evaluation Metrics for Imbalanced Classification Models

Measure Imbalanced Models the right way!

Rajneesh Tiwari
CueNex

--

Tour of ML Metrics for Imbalanced Dataset

Recap

In the last blog (link), we put together various strategies for Imbalanced Learning. We also looked at their advantages and disadvantages in detail, along with code for more obscure techniques.

With the basics about Imbalanced Learning in place, we will now delve deeper into a specific aspect of ML training, i.e. choosing the right Metrics for an Imbalanced Data use case.

In this blog, we will focus mostly on Imbalanced Classification use cases, and the metrics will also be in the same context.

What is an Evaluation Metric in Machine Learning?

When you want to train Machine Learning models, there are specific objectives that you want to achieve with these models. These objectives are business decisions, that depend upon a lot of business factors such as risk appetite, the growth stage of the business and growth strategy, and so on.

For example, in Fraud Detection use cases, we often work with established/large banks to flag Fraudulent Transactions. In most cases, large banks have a greater risk/loss appetite, meaning, they care more about customer experience and want to limit false positives.

This essentially means they want us to flag Fraud cases only when the model is super confident about its predictions, even if this means the model incorrectly flagging a few fraudulent cases as genuine ones. So there is a tradeoff that is established based on business rules.

This business decision impacts how we train our Machine Learning models, and we often set this expectation at the start of our engagements by deciding on a Machine Learning model Metric.

Confusion Matrix, the Building Block

The Confusion Matrix provides a good baseline view of how to construct a variety of Evaluation Metrics.

A confusion matrix is a performance measurement tool, often used for machine learning classification tasks where the output of the model could be 2 or more classes (i.e. binary classification and multiclass classification). The confusion matrix is especially useful when measuring recall, precision, specificity, accuracy, and the AUC of a classification model.

To conceptualize the confusion matrix better, it’s best to grasp the intuitions of its use for a binary classification problem. Without any annotations, the confusion matrix would look as follows:

Confusion Matrix

For a binary classification problem, the confusion matrix is a 2 x 2 matrix containing 4 types of outcomes:

  • TP — True Positive
  • TN — True Negative
  • FP — False Positive
  • FN — False Negative

Here TP and TN denote the number of examples classified correctly by the classifier as positive and negative respectively, while FN and FP indicate the number of misclassified positive and negative examples respectively.

Weighted Balanced Accuracy

The first important metric for Imbalanced Data cases is Weighted Balanced Accuracy. This metric adjusts the Accuracy metric as per class weights, wherein classes with lower class weights receive higher weightage.

Note that WBA is a thresholded metric, meaning, we have to first apply thresholds to get actual binarized predictions.

Let's look at an example below:

Numerical Example for Balanced Accuracy

Accuracy in the above example we have correctly classified 4 out of 8 samples, hence accuracy is 50%

However, the balanced accuracy is 58%, which takes into account the class imbalance as well. The balanced accuracy is based on the formula below:

In general, we can formulate Weighted Balanced Accuracy as a generalized form of Balanced Accuracy :

Weighted Balanced Accuracy

Typically, we want to select class weights such that each weight is within [0,1], and across classes these sum to 1.

Criteria for selecting class weights

In most imbalanced data use cases, the rare class is the more important class, hence, we generally choose weights to reflect this. One such formulation that is very commonly used is the Normalized Inverse Class Frequency.

Normalized Inverse Class Frequency

The Weighted Balanced Accuracy reaches its optimal value at 1 and its worst value at 0.

F-beta Score

The F-beta score is a very robust scoring mechanism for scoring both balanced and unbalanced data use cases. It is the generalized form of the commonly used F1 score with β value set to 1. F-Beta takes into account both Precision and Recall and does a weighted Harmonic mean between the two.

Fbeta is also a thresholded metric, meaning, we have to first apply thresholds to get actual binarized predictions.

Note that Precision calculates the percentage of correct predictions for the positive class.

Recall is the percentage of correct predictions for the positive class out of all positive predictions that could be made.

The F1 Score is calculated as the simple Harmonic Mean of Precision and Recall. The generalized form i.e. the F-beta score weighs the contribution of Precision or Recall to the metric.

FBeta Formula

Considering the same numerical example as above:

Numerical Example for F-Beta

We can calculate the following based on TP, FP, TN, and FN scores

  • Precision = 0.75
  • Recall = 0.6
  • F1 = 0.66
  • F2 = 0.625

A smaller value of beta gives more weight to Recall, while a large value of beta gives lower weight to Recall.

The F-beta score reaches its optimal value at 1 and its worst value at 0.

Probabilistic F Score

Owing to obvious drawbacks in deciding a thresholding function for traditional F Scores, this extension of the traditional F score accepts probabilities instead of binary classifications.

It was first proposed in this (link) paper as an extension of Precision, Recall, and F1 scores for a more thorough evaluation of Classification Models.

It has a few advantages vis-a-vis the thresholded F-Scores.

  1. Lower likelihood of being NaN, robust to NaN values
  2. Sensitive to model’s prediction confidence scores
  3. Have lower variance making the point estimates generalize better to test datasets
  4. Provide the same ranking of the performance of candidate models as their threshold-based counterpart’s population value in the majority of cases

Here is the math behind the function:

Let M be the matrix of class confidence scores, for a C class problem. Then the class with the highest confidence will be the model’s prediction for that observation.

Model confidence Matix

By applying the model M on the full dataset S, we obtain a confidence score M(xi , Cj ) for each i ∈ {1, …, n} and j ∈ {1, …, m}. Suppose Sj denotes the set of samples with true class Cj . We can build a probabilistic confusion matrix pCM as follows:

Probabilistic Confusion Matrix

Intuitively, each cell (jref , jhyp) of the confusion matrix corresponds to the total confidence score assigned by the model to hypothesis jhyp for samples for which the true class is jref . It is very similar to the usual definition of a confusion matrix, apart from the fact that we leverage all confidence scores as quantitative values as rather than just the highest-scoring class as a qualitative value.

From this probabilistic confusion matrix, cRecall and cPrecision are calculated in the same way that Recall and Precision are from the non-probabilistic (regular) confusion matrix.

Here is a Python implementation of the probabilistic F Score.

def pfbeta(labels, predictions, beta):
y_true_count = 0
ctp = 0
cfp = 0

for idx in range(len(labels)):
prediction = min(max(predictions[idx], 0), 1)
if (labels[idx]):
y_true_count += 1
ctp += prediction
else:
cfp += prediction

beta_squared = beta * beta
c_precision = ctp / (ctp + cfp)
c_recall = ctp / y_true_count
if (c_precision > 0 and c_recall > 0):
result = (1 + beta_squared) * (c_precision * c_recall) / (beta_squared * c_precision + c_recall)
return result
else:
return 0

#credits: https://www.kaggle.com/code/sohier/probabilistic-f-score/notebook

Consider the numerical example below which is based on python code calculations as shown above:

Numerical Example for pF1 calculation

From the above calculations, we can calculate

  • pPrecision = 0.685
  • pRecall = 0.659
  • pF1 = 0.672

Note that pFScore is based on probabilities and is NOT a thresholded metric.

Precision-Recall Curve (AUC-PR)

Precision-Recall Curve is a very commonly used measure of prediction efficacy when the classes are very imbalanced. Precision measures relevancy and is the fraction of relevant instances among the retrieved instances, while recall is the fraction of relevant instances that were retrieved.

The precision-recall curve depicts the tradeoff between precision and recalls for different thresholds. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).

Example PR Curve (Source: sklearn)

Note that PR Curve is based on probabilities and is NOT a thresholded metric.

There is a similar metric AUC-ROC curve, which works similar to AUC-PR but is based on TPR-FPR rates.

However, the AUC-ROC curve is not preferred under severe imbalance, as it produced over optimistic results especially if the number of rare class is very small. Generally for Imblanced Scenarios, we prefer PR-AUC as the metric of choice.

How to choose the right Metric for Imbalanced Data Scenarios?

At CueNex, we know there is no one right metric that can be applied to all the cases. We often discuss with our clients their error preferences. Specifically, we try to understand when the business is more sensitive toward False Positives or False Negatives.

Generally, if the positive/rare class is more important, then we use PR-AUC in combination with F1/F2/F0.5/pF scores.

  • If both False Negatives and False Positives are equally important then we use F1-Score
  • If False Positives are more important than False Negatives then we use F0.5-Score
  • If False Negatives are more important than False Negatives then we use F2-Score
  • If the output is measured via Probabilities, then we use pF1/pF2/pF0.5 as per the scenarios above

Hope you liked this blog. We will delve much deeper into case studies in the coming set of blogs, where we take an in-depth look at the data and how we solve fraud at CueNex.

--

--