Balancing classes for highly imbalanced data

2 min readJan 9, 2018

For fraud detection, you will find that the majority of your data belongs in a single class. This makes it hard for the model to learn the data patterns in minority class.

To achieve class balance, training data sampling for highly imbalanced binary classification problems has below common approaches:

under sample the negative class
oversample the positive class
simulate the positive class

When working with highly imbalanced classes, the industrial pracitce is usually to undersample the negative class for adequate model learning. In my previous work fraud detection problems, 95% of the cases I would undersample the negative class. Reasons being:

there is more data than the resources can handle
have enough good samples for downsampling to represent data well

Rarely have I tried the oversampling of positive class approach successfully. Lately, when modeling for a small segment of users, bad rate was very low (<1%) and dataset was very small (<500K records). I was reluctant to downsample the negative examples lest I lose any negative class patterns.

To get a similar effect as oversampling the positive class, I tried scale_pos_weights parameter in xgboost algorithm in sklearn, to control the balance between positive and negative classes. This parameter is suggested to be set to (#of negative/#of positive records).

Advantages of this approach:

Avoid modifying training data, yet change the weight of positive observations.
its similar to cost-sensitive approach where the model learns to classify the rare event better by penalizing its incorrect prediction.

I did grid search over the parameter and set it accordingly. Below are the results on test set that I thought are worth sharing :

You can see that scale_pos_weights = 30 is significantly better than the default value.

Conclusion:

For each problem dataset, experimenting with various parameters and optimizing the algorithm for each feature space is important.

Tips:

Do not oversample or undersample a single class in the test data ! Test data should be a true representation of the real world population. We’ve all been there :)
Use precision-recall curves for model evaluation over ROC-AUC , in case of highly imbalanced classes. You care about predicting True Positives correctly , hence precision is an important metric.

Balancing classes for highly imbalanced data

Written by Avani Sharma