image by Lexica AI

How to handle an imbalanced dataset

Iva @ Tesla Institute
Artificialis
Published in
5 min readFeb 24, 2023

--

Machine Learning models learn best when they are given a similar number of examples for each label class in a dataset. Many real-world problems are not as nearly balanced. Take, for example, a fraud detection use case, where we’re building a model to identify fraudulent credit card transactions. Fraudulent transactions are much rarer than regular transactions and there is less data on fraud cases available to train a model.

In such cases, it is important to evaluate the performance of the model using appropriate metrics that consider the class imbalance. Precision, recall, and F1 score are commonly used metrics for imbalanced datasets. It is also important to carefully choose the threshold for classification to optimize the performance of the model for the minority class.

We will dedicate more attention to metrics later. Now, let’s check the misleading accuracy.

MISLEADING ACCURACY

A common mistake in training models with imbalanced label classes is relying on misleading accuracy values for model evaluation. If we train a fraud detection model and only 5% of the dataset contains fraudulent transactions, chances are that our model will train to 95% of accuracy and that is technically correct, although there’s a good chance the model is guessing the majority class. It means it’s not learning anything about how to distinguish the minority class from other examples in our dataset.

Using these predicted values and their corresponding actual values, we can create a confusion matrix:

confusion matrix

In this example, the model correctly guesses the majority class 95% of a time but only guesses the minority class corectly 12% of the time.

Since the accuracy can be misleading on imbalanced datasets, it’s important to choose an appropriate evaluation metric when building the model. There are various techniques for handling inherently imbalanced datasets: Down-sampling and Upsampling:

Down-sampling is a technique used to balance an imbalanced dataset by randomly removing instances from the majority class, thus reducing the imbalance between the two classes. The aim of down-sampling is to create a subset of the majority class that is similar in size to the minority class, which can help prevent the model from being biased toward the majority class.

One common way of down-sampling is to randomly select a subset of instances from the majority class that is equal in size to the minority class. Another way is to remove instances from the majority class until the two classes are balanced. However, down-sampling can cause loss of information from the original dataset, and the model may not learn from the full range of instances.

Up-sampling is another technique used to balance an imbalanced dataset by increasing the number of instances in the minority class, thus making it closer in size to the majority class. Up-sampling is done by replicating instances from the minority class, either randomly or using a more sophisticated technique like SMOTE (Synthetic Minority Over-sampling Technique). Up-sampling can help prevent the model from being biased towards the majority class and can improve the model’s ability to identify the minority class. However, it can also result in over-fitting, where the model becomes too specialized on the minority class and does not generalize well to new data.

CHOOSING THE EVALUATION METRIC

Since accuracy can be misleading on imbalanced datasets, it’s important to choose an appropriate evaluation metric when building a model. There are various techniques that can be employed for handling inherently imbalanced datasets at both dataset and model level.

The best practice is to use metrics like precision, recall, or f-score to get a complete picture of how the model is performing.

Precision, recall, and F-score are commonly used metrics to evaluate the performance of a classification model.

Precision measures the proportion of true positive predictions among the total number of positive predictions made by the model. In other words, precision indicates how many of the positive predictions are actually correct. It is calculated as:

Precision = TP / (TP + FP)

where TP is the number of true positives (i.e., the number of cases where the model correctly predicted a positive class) and FP is the number of false positives (i.e., the number of cases where the model incorrectly predicted a positive class).

Recall, on the other hand, measures the proportion of true positive predictions among the total number of actual positive cases in the dataset. In other words, recall indicates how many of the positive cases in the dataset were correctly identified by the model. It is calculated as:

Recall = TP / (TP + FN)

where FN is the number of false negatives (i.e., the number of cases where the model incorrectly predicted a negative class, but the actual class was positive).

F-score is a combined metric that takes into account both precision and recall. It is calculated as the harmonic mean of precision and recall:

F-score = 2 * (precision * recall) / (precision + recall)

F-score provides a balanced view of the model’s performance, taking into account both false positives and false negatives. A higher F-score indicates a better performance of the model in terms of both precision and recall.

A perfect model would have both precision and recall of 1.0, but in practice, these two measures are often at odds with each other.

Let’s return to the fraud detection use case to see how each of these metrics plays out in practice. For this example let’s say our test set contains a total of 1000 examples, 50 of which should be labeled as fraudulent transactions. For these examples, the model predicts 930/950 nonfraudulent examples correctly, and 15/50 fraudulent examples correctly. Let’s visualize that:

predictions for fraud detection model

In this case, our model’s prediction is 15/35 (42%), recall is 15/50 (30%) and f-score is 35%. These metrics do a much better job capturing the model’s inability to correctly identify fraudulent transactions compared to accuracy, which is 945/1000 (94,5%) Therefore, for models trained on the imbalanced dataset, metrics other than accuracy are preferred.

Note: when evaluating the models trained on imbalanced datasets, we need to use unsampled data when calculating the metrics. In other words, our test set should have roughly the same class balance as the original dataset.

CONCLUSION:

If you are looking for a metric that captures the performance of the model across all thresholds, average precision-recall is a more informative metric than the area under the ROC curve for model evaluation. This is because average precision-recall places more emphasis on how many predictions the model got right out of the total number it assigned to the positive class. This gives more weight to the positive class, which is important for imbalanced datasets. The AUC, on the other hand, treats both classes equally and is less sensitive to model improvements, which is not optimal in situations with imbalanced data.

--

--

Iva @ Tesla Institute
Artificialis

hands-on hacks, theoretical dig-ins, and real-world know-how guides. sharing my notes along the way; 📝