Imbalanced Learning in Banking

Published in

Swedbank AI

10 min readApr 9, 2020

Much like anomaly detection, imbalanced learning is a common practice in banking. In this post, we will discuss a number of methods that assist us in our daily work at Swedbank.

Introduction

In previous posts, we have discussed anomaly detection and its importance in banking analytics and machine learning applications (here and here). In this post, we will look at a closely related area where anomaly detection can be regarded as the extreme case.

Having a data set with imbalanced classes is a somewhat frequent problem in binary classification. An imbalanced data set is one where the size of one class (the majority class) is much larger than that of another class (the minority class). A clear example is that of churn analysis— churners (the minority class) tend to a very small portion of the customer base.
Imbalanced data is tightly connected to rare-event phenomena. Such rare events usually convey key information and their adequate incorporation in models is of significant weight. According to Google AI, different degrees of imbalance in data can be summarized as:

A wide variety of classification problems that we encounter in daily banking such as churn prediction or fraud detection, deal with imbalanced data sets. Quite often we are faced with the extreme cases where the proportion of minority class varies between 0.1% and 0.2%.

In ordinary classification problems, a disparity between class frequencies can have a disadvantageous impact on model fitting. Machine learning techniques tend naturally to learn more from the majority class and hence become biased when it comes to prediction. As a result, such models are inclined to predict almost all future data points as the majority class. The above is usually a consequence of the default design of ordinary ML methods as they are set to maximize accuracy — in a heavily imbalanced data set classifying all observations as the majority class always guarantees a high accuracy.

Methodologies to handle imbalanced data

The different strategies that can be employed to handle imbalance fall into two main categories:

Data manipulation: modifying data sets
Model manipulation: cost sensitive learning

In the remainder of this post we will go through some of the common approaches in each category.

Modifying data sets

The main idea behind the techniques in this category is to make the data sets balanced (50–50) by resampling the original data, synthesizing new samples for the minority class or using a hybrid approach via a combination of both methods.

1. Resampling strategies (down- & up-sampling)

The most common ways to balance an imbalanced data set are to either randomly remove samples from the majority class, i.e. down-sampling, or randomly sample (with replacement) from the minority class to reach the same size as the majority class, i.e. up-sampling.

Down-sampling and up-sampling have their own advantages and disadvantages. Both down-sampling and up-sampling have low complexity when it comes to implementation. However, the main risk with down-sampling methods is loss of potentially useful data. The main drawbacks of up-sampling strategies are overfitting as the data points of the minority class duplicate and increase learning time due to a new larger data set.

2. Strategies for generation of synthetic data

The abovementioned methods focus on randomly removing or repeating samples. To avoid the potential problems that can occur after resampling, it is possible to instead synthesize artificial samples for the minority class. This can be done in a systematic way through different techniques. We will review some of the most well-known methodologies.

2a. SMOTE (Synthetic Minority Oversampling Technique)

SMOTE [1] is an up-sampling method which creates synthetic samples from the minority class using the following approach:

Randomly choose a sample from minority class (red circle),
Using K-nearest-neighbor (KNN) select K samples from the minority class (K = 5 in the figure below),
Randomly select a sample within the neighborhood,
Draw a line between the two samples and synthesize a new sample along the line (orange star).

An illustration of this method is shown in figure below:

Despite these steps, SMOTE and SMOTE-like algorithms can also be used for down-sampling the majority class.

While implementation of the SMOTE algorithm is rather straightforward, its main drawback is that it gives the same weight to all samples from the minority class and the samples which are harder to learn might be out of reach in specific cases.

2b. ADASYN (Adaptive Synthetic Sampling)

ADASYN [2] is based on the idea of adaptively generating minority data samples according to their distributions. In some cases, it potentially addresses the main drawback of the SMOTE algorithm by generating synthetic data samples for the minority class that are non-trivial to learn. The following figure demonstrates how ADASYN differs from SMOTE in terms of synthetic data generation:

The samples from the minority class around the corners that are harder to learn are more frequently presented by the ADASYN algorithm.

2c. ROSE (Random Up-sampling Examples)

ROSE [3] is a technique based on the bootstrap that aids the task of binary classification in the presence of rare classes. In short, the bootstrap and the smoothed bootstrap are considered as alternative methods for estimating properties of unknown distribution moments.

ROSE handles both continuous and categorical data by generating synthetic examples from conditional density estimates of the present classes. Based on the estimation, artificial samples are synthesized around the neighborhood of the minority class using a user-defined kernel, e.g. Gaussian kernel (with mean vector and covariance matrix estimates). The figure below illustrates the original data points on the left side, and on the right side, we observe synthesized samples from the minority class using the ROSE algorithm — red circles: majority, blue circles: minority (figure taken from the original paper [3]).

It is easy to implement the ROSE algorithm, however, the synthesized samples through this algorithm can be inaccurate as the algorithm is based on kernel density estimates that are symmetric by default and need tuning of corresponding smoothing parameters.

Cost-sensitive learning

The methods discussed above are aimed at engineering the original imbalanced data set. An alternative approach is to instead introduce adjustments for the loss/cost functions. Using cost-sensitive learning, we can provide different weights for the majority and minority class directly in the cost function. By doing so, the training process puts a larger emphasis on learning the samples from the minority class correctly. An example of weighted cross-entropy cost function is given below (weights are shown by coefficients Ci,j):

By letting the coefficients Ci,j equal one and the index j belong to a set of unit cardinality (i.e., no class imbalance is considered), the formulation above boils down to the ordinary cross-entropy cost function.

Here, for the purpose of demonstration, we focus on the XGBoost algorithm where parameters related to its cost function are present in its R and Python implementations.

The following snapshot from R shows how the parameters of the XGBoost model can be adjusted to deal with imbalanced data sets. The parameter ‘scale_pos_weight’ (in a bold-font format below) can be tuned for this purpose.

weighting_XGB.pars = list("objective" = "binary:logistic",
             "scale_pos_weight" = sqrt(sumwneg / sumwpos),
             "eta" = 1e-2,
             "gamma" = 0.1,
             "subsample" = 0.7, 
             "colsample_bytree" = 0.7,
             "max_depth" = 7,
             "lambda" = 0.1,
             "alpha" = 0.1,
             "eval_metric" = "auc",
             "silent" = 1,
             "nthread" = 8)

Experimental results

We compare the abovementioned methodologies in handling the imbalanced data problem by studying their performances and complexities for one of our in-house use cases (UCs) with the following properties of class imbalance ratios:

UC1: records: 102K, 31 features, (minority/majority ratio: 0.00674)

The data set contains numerical features and is split into a training and test set with a ratio of 0.8/0.2. We use XGBoost to fit the engineered data sets with 5-fold cross-validation (CV), except when we use weighted XGBoost for the cost-sensitive learning method. We did not use any grid search for hyperparameter tuning and the best parameters of XGBoost are chosen such that they provide the highest accuracy.

We use the R programming language on a local machine using a 32-GB of RAM with Core i7, 3.40GHz, and 64-bit operating system Windows 10. For SMOTE and ROSE we used the caret package and for XGBoost we used the xgboost package. Also, in the results below we show performance of a few other methods, including DBSMOTE, BLSMOTE, ANS, RSLS, SLS which are all different variations of the SMOTE algorithm. These algorithms as well as ADASYN are parts of the SMOTE-family package. Lastly, we also perform down- and up-sampling as well as SMOTE and ROSE inside as well as outside of the cross-validation folds. The suffices inside and outside in the plots below refers to these two different scheduling approaches.

Performance

For performance evaluation, we calculate the AUCs for test data (first bar plot) while keeping an eye on the differences between training & test AUCs (second bar plot) in order to check the degree of overfitting for the different methods.

Note that the percentage on top of each bar in the plot is in relative percent compared to a baseline method: XGBoost without any resampling or cost-sensitive learning (shown in grey color).

The higher the AUC the better the performance. Comparing the different methods, we see that a slight majority provide better performance than that of the baseline model. Among the studied methods, down-sampling, weighted XGBoost and SMOTE outperform the others. It is also interesting to see that if sampling is performed inside the CV folds, the performance of the down-sampling method is improved, on the other hand, this is the opposite case for SMOTE. As expected, up-sampling critically suffers from overfitting. Weighted XGBoost also gives a high overfitting degree. However, from our experiments, an overfitting degree of around 10%-15% can be acceptable in some of our use cases.

Complexity analysis

Analysis of model complexity in terms of computational expenditure needs a thorough study which could be discussed in a future post. Here, we provide model runtime for the top three candidates: down-sampling (inside/outside), SMOTE (inside/outside) and weighted XGBoost for one use case.

Findings and takeaways

Generally, there are several reasons for using resampling/synthetic data generation methods over cost-sensitive learning methods:

Not all machine learning algorithms come with implementations of cost-sensitive loss functions.
Sampling methods are often less complex.
Many highly skewed data sets tend to be large and the size of the training set must be reduced in order to make the training feasible. In such cases, down-sampling can be a fast and reasonable strategy.

Conversely, there are reasons for using cost-sensitive learning methods over resampling/synthetic data generation methods:

It does not require any pre-processing: algorithm selection and tuning to generate artificial samples.
Cost-sensitive learning often provides better performance compared to resampling/synthetic data generation methods.

More specifically, what we can conclude from the results above is that there is no single best-performing technique. Given the results in our study (test performance, overfitting, complexity and implementation), we can take these methods into consideration for handling imbalanced data sets:

Weighted XGBoost (package available in R and Python)
Down-sampling (Outside CV is less complex, inside CV is available in the CARET package in R)
SMOTE (Outside CV is less complex, inside CV is available in the CARET package in R)

Further discussion

Most of the resampling and synthetic data generation algorithms work only with numerical features. One way to treat categorical features is to use one-hot encoders and consider them as numerical values. Later, during synthetic data generation, we should be careful that the generated categorical samples have the same distribution characteristics as the original ones.
One-hot encoding of categorical features may result in sparse data sets. This can consequently lead to sparsely represented categories with zero or near-zero variances. In these situations, sampling strategies specified for sparse data sets can be of use, e.g., conditional random sampling.
Due to differences in implementations of the methods and strategies mentioned above, results between runs in different packages or programming languages may be subject to differences.

References

[1] Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP, SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002, 16: 341–378.

[2] Haibo He, Yang Bai, E. A. Garcia and Shutao Li, ADASYN: Adaptive synthetic sampling approach for imbalanced learning, 2008 IEEE International Joint Conference on Neural Networks, Hong Kong, 2008, pp. 1322–1328.

[3] Lunardon, N., Menardi, G., and Torelli, N. (2014). ROSE: a Package for Binary Imbalanced Learning. R Jorunal, 6:82–92