MACHINE LEARNING

Imbalanced Learning in Fraud Prevention

Views and Solutions from the trenches

Rajneesh Tiwari

Published in

CueNex

7 min readFeb 2, 2023

What is Imbalanced Learning in ML?

Imbalanced Learning refers to Supervised ML modeling scenarios when the target label overwhelmingly comprises a single dominant label. This is very common in many real-world scenarios such as Credit Card Fraud, Insurance Claims Fraud, Rare Disease Diagnosis, Anomaly Detection, Manufacturing Defect Detection, Churn Prediction, and so on!

For example, in Breast Cancer Detection, we see no more than 1% of Cancerous cases (or positive label instances). In Anomaly Detection or Credit Card Fraud Detection, P&C Claims fraud, true anomalies don’t constitute more than 1% of instances.

In fact, you’d be surprised by how common label imbalance is across business use cases we solve at CueNex. For instance, we solve Fraud Detection use cases at CueNex and one of the most challenging tasks after Feature Engineering is to account for such huge class imbalances in our datasets.

Typically, at CueNex, we like to assign any use cases more severe than 95%-5% (majority-minority class label) distribution as an Imbalanced Dataset. This helps us leverage our imbalanced ML pipelines to solve these use cases more efficiently and not treat them as generic ML use cases.

More generally, for a k class classification problem, we define imbalance severity as:

Imbalance Severity for k class classification

Why is Imbalanced Learning so important?

Imbalanced datasets pose varying challenges as far as Machine Learning or Deep Learning is concerned.

Challenges for Classical Machine Learning:

Classical ML algorithms tend to learn biases of the majority class
Classical ML Models tend to have poor probability calibration for Imbalanced Data Scenarios
Classical ML overwhelmingly learns signals from the majority class
This leads to overfitting models

Challenges for Deep Learning

Deep Neural Networks tend to memorize labels in Imbalanced Data Scenarios
Explicit regularization techniques such as augmentations, and dropout don’t work
Implicit regularization methods such as SGD, Residual connections, etc don’t work as well. Recall that SGD converges to the solution with the smallest norm
This leads to models that don’t generalize at all on test sets

Imbalanced Learning Approaches

There are primarily three distinct approaches to solving imbalanced data scenarios:

Data-level Approaches: These methods focus on the alteration of the underlying dataset to make it balanced (e.g. by oversampling or undersampling), thus being classifier-agnostic approaches. They focus on resampling or learning more robust representations
Algorithm-level Approaches: These methods focus on modifying the training approach to make classifiers robust to skewed distributions. They are dedicated to specific learning models, being often more specialized, but less flexible than their data-level counterparts. Algorithm-level modifications focus on identifying mechanisms that suffer from class imbalance, cost-sensitive learning, or one-class classification

Popular Imbalanced Learning Methods

Random Undersampling

In this data-based approach, we undersample the Majority class such that the final dataset has almost equal representation of both/all classes.

This method ensures more or less equal representation of both classes but ends up losing quite a bit of information by throwing out a lot of Majority class samples.

# Import Imblearn which is a convenient package for sampling
from imblearn.under_sampling import RandomUnderSampler

# Define undersample strategy and initalize it
undersample = RandomUnderSampler(sampling_strategy='majority')

# fit and apply the transform
X_undersampled, y_undersampled = undersample.fit_resample(X, y)

Random Oversampling

In this data-based approach, we randomly oversample the Minority class, such that the final dataset has an almost equal representation of both/all classes.

This method leads to a lot of redundant training examples as we simply copy/duplicate Minority class observations.

# Import Imblearn which is a convenient package for sampling
from imblearn.over_sampling import RandomOverSampler

# Define oversampling strategy and initalize it
oversample = RandomOverSampler(sampling_strategy='minority')

# fit and apply the transform
X_oversampled, y_oversampled = oversample.fit_resample(X, y)

SMOTE

Smote stands for Synthetic Minority Oversampling Technique, which is a data oversampling technique that makes use of intelligent data augmentation to avoid creating redundant examples that oversample the minority class.

Consider a case, where X1 is the nearest neighbor of X0. Then, the SMOTE algorithm will create new samples based on the formula as depicted below:

SMOTE synthetic data formula

#import SMOTE from imblearn package
from imblearn.over_sampling import SMOTE

# initialize the model
oversample = SMOTE()

# transform the dataset
X, y = oversample.fit_resample(X, y)

Cost Sensitive Training

Cost-Sensitive or Class Weighted training aims to tackle imbalanced datasets by applying uneven penalties or costs when making predictions. The idea is to apply higher penalties for mistakes in predicting minority classes.

In many cases, we can weigh the Cost Function based on the labels. For example, we can allocate weight Wc to all instances of class c.

This can be achieved by simply passing a class weight dictionary in popular sklearn training APIs or by simply adding pos_weight in Pytorch-based loss functions.

# Scikit Learn Example of Class Weighted Learning
#Import LogReg from sklearn linear models
from sklearn.linear_model import LogisticRegression
#Fit the model, and use appropriate class weights in fit argument
#Note that we use 0.5 and 2 as weights for majority-minority classess resp.
logreg = LogisticRegression(C=1e9,class_weight={0:0.5,1:2}).fit(X,y)

# Pytorch Example of Class Weighted Loss
# Note that we use weight=5 for rare class
loss = F.binary_cross_entropy_with_logits(logits,target,pos_weight=torch.tensor(5))

Focal Loss

Focal Loss (paper) addresses class imbalance during training in tasks like classification and object detection.

Focal loss applies a modulating term to the cross entropy loss in order to focus learning on hard misclassified examples. It is a dynamically scaled cross-entropy loss, where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples.

Let pt be the predicted value, then Focal Loss for predefined Alpha and Gamma is as defined below:

Focal Loss Equation

What is the role of Alpha and Gamma in the above equation?

γ controls the shape of Focal Loss curve. The higher the value of γ, the lower the loss for well-classified examples, so we could turn the attention of the model more towards ‘hard-to-classify examples.
αallocates higher weights to rare class and lower weights to majority class.

Here is an example of Focal loss for Binary Classification written in Pytorch.

# import nn module from torch
from torch import nn

#define focal loss class
class FocalLoss(nn.Module):
    def __init__(self, alpha=.25, gamma=2):
        super(WeightedFocalLoss, self).__init__()
        self.alpha = torch.tensor([alpha, 1-alpha]).cuda()
        self.gamma = gamma
def forward(self, inputs, targets):
        BCE_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction='none')
        targets = targets.type(torch.long)
        at = self.alpha.gather(0, targets.data.view(-1))
        pt = torch.exp(-BCE_loss)
        F_loss = at*(1-pt)**self.gamma * BCE_loss
        return F_loss.mean()

In-batch Data Sampling for Deep Learning

When using Deep Learning models, we use Data Samplers to continuously feed batch data to models.

Here we can perform oversampling or undersampling by assigning class weights in Pytorch Dataloader.

The below loader will assign [1/class_count] as a weight to respective train samples, effectively meaning that rare class instances will be quite likely to be picked by the data loader and be seen by the neural network.

# import weighted random sampler
from torch.utils.data import WeightedRandomSampler

# below function yeilds a weighted randomsampler
def class_imbalance_sampler(labels):
    class_count = torch.bincount(labels.squeeze())
    class_weighting = 1. / class_count
    sample_weights = class_weighting[labels]
    sampler = WeightedRandomSampler(sample_weights, len(labels))
    return sampler

# now use sampler in Data Loader like below:
loader_train = torch.utils.data.DataLoader(dataset_train,
                                            batch_size,
      sampler = class_imbalance_sampler(train_labels)))
)

A big disadvantage of class imbalanced sampling is that we cannot ensure any batch-level consistencies in positive and negative classes.

For example, one batch of size 16 might have 8 positive instances, while the next batch might have 11 positive instances. This leads to poor learning and generalization in neural networks.

A Better Balanced Sampler for Deep Learning

Most of the time we want to impose certain constraints as to how much of a particular class should go in any batch and we might still want to convey to our network that one of these classes is actually rare. So what should we do in these cases when we want to ensure k observations in a batch of size bare from a minority class.

Well, we can use a custom-defined sampler like below:

class BalanceSampler(torch.utils.data.Sampler):

    def __init__(self, dataset, ratio = b):
        self.r = ratio-1
        self.dataset = dataset
        self.pos_index = np.where(dataset.df.target>0)[0]
        self.neg_index = np.where(dataset.df.target==0)[0]

        self.length = self.r * int(np.floor(len(self.neg_index)/self.r)) 
        self.ds_len =  self.length + (self.length // self.r) 

    def __iter__(self):
        pos_index = self.pos_index.copy()
        neg_index = self.neg_index.copy()
        np.random.shuffle(pos_index)
        np.random.shuffle(neg_index)

        neg_index = neg_index[:self.length].reshape(-1,self.r)
        pos_index = np.random.choice(pos_index, self.length//self.r).reshape(-1,1)

        index = np.concatenate([pos_index,neg_index],-1).reshape(-1)
        return iter(index)

    def __len__(self):
        return self.ds_len

### Code credit: https://www.kaggle.com/hengck23

Conclusion

In this blog, we discussed imbalanced class scenario and how it affects rare event classification problems such as Fraud detection. The imbalance in class distributions may vary, but a severe imbalance is more challenging to model and requires experience and specialized techniques.

At CueNex, we utilize the best-suited method or combination of the methods listed above to address unique challenges posed by the imbalanced class distribution.

Another very important aspect while dealing with Imbalanced datasets is the choice of Metric. We will delve deeper into the choice of metrics in the next blog.

See you at the next one! 👋