How to Handle Imbalanced Data in Classification Problems

Hoang Minh
9 min readOct 10, 2018

--

An example of imbalanced data set — Source: More (2016)

If you have been working on classification problems for some time, there is a very high chance that you already encountered data with imbalanced classes.

The name speaks for itself, imbalanced data set occurs when there is an unequal representation of classes

Imbalanced data is not always a bad thing, and in real data sets, there is always some degree of imbalance. That said, there should not be any big impact on your model performance if the level of imbalance is relatively low.

However, in certain areas such as fraud detection, medical diagnosis and risk management, severe imbalance class distribution is relatively common, and therefore, is a concerning problem. For instance, at James, we use Data Science to help banks and financial institution lower default rates, and often confront with severely imbalanced data sets where minority class accounting only for 2–5%.

Challenges

There are 3 main problems imposed by data with unequal class distribution. They are as follows:

  1. The machine problem: Machine learning (ML) algorithms are built to minimize errors. Since the probability of instances belonging to the majority class is significantly high in imbalanced data set, the algorithms are much more likely to classify new observations to the majority class. For example, in a loan portfolio with an average default rate of 5%, the algorithm has the incentive to classify new loan applications to non-default class since it would be correct 95% of the time.
  2. The intrinsic problem: In real life, the cost of False Negative is usually much larger than False Positive, yet ML algorithms penalize both at a similar weight. Let’s take credit scoring as an example, if your model predicts that a loan will default yet it turns out not to be the case, the maximum loss you’re subjected to is the profit you would have made by issuing that loan. On the other hand, if your model classifies a default loan as being safe, the cost now is substantially larger. Apart from the foregone profit, you will also lose, in the worst case, all issued amount.
  3. The human problem: In credit risk, common practices are often established by experts, rather than empirical studies (Crone and Finlay, 2012). This is surely not optimal, given that your population might be very different from the other bank’s population. Therefor, what works in a certain loan portfolio might not work in yours.

There are several articles addressing the issue with imbalanced data. However, they are mostly theoretical and do not provide much practical guidelines on how to actually solve the problem. In this blog post, we aim to walk you through the most common solutions, as well as how to implement them using Imbalanced-learn library. The methodology and pipeline will be carefully explained so that you can replicate the experiments on your own data set.

Solutions

There has been two different approaches to addressing imbalanced data: algorithm-level and data-level approach.

Algorithm approach: As mentioned above, ML algorithms penalize False Positives and False Negatives equally. A way to counter that is to modify the algorithm itself to boost predictive performance on minority class. This can be executed through either recognition-based learning or cost-sensitive learning. Feel free to check Drummond & Holte (2003); Elkan (2001); and Manevitz & Yousef (2001) in case you want to learn more about the topic.

Data approach: This consists of re-sampling the data in order to mitigate the effect caused by class imbalance. The data approach has gained popular acceptance among practitioners as it is more flexible and allows for the use of latest algorithms. The two most common techniques are over-sampling and under-sampling.

  1. Over-sampling increases the number of minority class members in the training set. The advantage of over-sampling is that no information from the original training set is lost, as all observations from the minority and majority classes are kept. On the other hand, it is prone to overfitting.
  2. Under-sampling, on contrary to over-sampling, aims to reduce the number of majority samples to balance the class distribution. Since it is removing observations from the original data set, it might discard useful information.

Data set

We conducted experiments on two data sets, as described below:

  • The UCI data set with 30,000 observations and 24 features contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. With an imbalance ratio of almost 1: 4, this data set is quite imbalanced and was selected for the experiments.
  • James’ data set with 56,098 observations and 18 features is constructed with synthetic samples yet still has the characteristics of a real one. It includes information on default loan payments with various features such as demographic factors, personal information, credit data, monthly income, etc. With a ratio of 1:50, this data set is severely imbalanced.

Methodology

Pipeline for the experiments

After the pre-processing step, the sampling techniques were applied. To ensure the model learn from all observations, one run of 10-fold cross-validation was executed for each experiment. In terms of algorithms, we chose Logistic Regression as it has the hyper-parameter “balanced” class_weight, which can be used as an algorithm-level technique. Its simplicity and common application in credit risk is also a big plus.

At the end, results were compared against the baseline model to test for significant differences. With 9 experiments in total, we opted for Friedman and Nemenyi post hocs tests, over t-test, to minimize statistical error. Essentially every time we conduct a t-test, we are prone to Type I error (normally around 5%). The more t-tests we are running, the more Type I error we are prone to. Taking that to the extreme, if we do enough tests, almost any outrageous statement can be proved, such as green beans being linked to acne! With the combination of Friedman and Nemenyi post hocs tests, we can control the Type I error at 5%.

Experiments

A total of 9 experiments were carried out, using both data-level and algorithm-level imbalanced techniques. They are as follows:

  • “Balanced” class_weight hyper-parameter (E1) re-balances the data set by putting less weights on the majority class instances. Specifically, the weight is inversely proportional to class frequency as

n_samples / (n_classes * np.bincount(y)

ENN and NearMiss-3 Visualization — Source: More (2016)
  • EditedNearestNeighbours under-sampling technique (E2_ENN): The ENN method was proposed by Wilson (1972), in which a majority instance is removed if its class label does not agree with its K nearest neighbors. The ENN method tends to omit the noisy and borderline instances, which will therefore enhance the accuracy of decision boundary
  • NearMiss 3 under-sampling technique (E3_NM): NearMiss-3 belongs to the NearMiss family, which conducts under-sampling on the majority class according to their distance. NearMiss-3 in particular removes majority samples with the largest distance from minority samples’ K nearest neighbors.
  • SMOTE over-sampling technique (E4_SMT): SMOTE first considers the K nearest neighbors of the minority instances. It then constructs feature space vectors between these K neighbors, generating new synthetic data points on the lines.
  • ADASYN over-sampling technique (E5_ADS): Very similar to SMOTE, ADYSYN also creates synthetic data points with feature space vectors. However, for the new data points to be realistic, ADYSYN adds a small error to the data points to allow for some variance. This is because observations are not perfectly correlated in real life.
  • EditedNearestNeighbours & “Balanced” class_weight (E6_ENN)
  • NearMiss 3 & “Balanced” class_weight (E7_NM)
  • SMOTE & “Balanced” class_weight (E8_SMT)
  • ADASYN & “Balanced” class_weight (E9_ADS)

All of these 9 techniques and combinations of techniques are available on Imbalanced-learn library, making the development process much easier!

Findings & Discussions

UCI data set

Average ROC Scores across imbalanced techniques on UCI data set

In UCI dataset, there are 5 out of 9 techniques with better performance than the baseline model, which are “Balanced” class_weight (E1); SMOTE (E4_SMT); SMOTE & “Balanced” class_weight (E8_SMT); and ADASYN & “Balanced” class_weight (E9_ADS).

Here are some additional insights we can draw from the tests:

  1. Imbalanced techniques can improve model performance (5 out of 9 imbalanced techniques outperformed the baseline model). The result is consistent with studies by Japkowicz and Stephen (2002), which showed that imbalanced data set reduces performance, and with Marques, Garcia and Sánchez (2012), demonstrating the gains of using re-sampling techniques in imbalanced data sets.
  2. There’s no single best-performing technique. Even though E1 has the highest average ROC score, it is not statistically significant, compared with other methods.
  3. Over-sampling technique performed better than under-sampling technique in this particular data set. In fact, this result is in line with similar experiments conducted by Crone and Finlay (2012), and Marques, Garcia and Sánchez (2012). This might be due to the fact that under-sampling discards potentially useful information about the majority class, which is a well-known drawback of this technique.

James’ data set

Average ROC Scores across imbalanced techniques on James’ data set

Interestingly, the results from James’ data set are re-markedly different from the UCI data set. While there were 5 techniques performing better than the baseline model in UCI dataset, there was none in our data set. In fact, there were 2 techniques with worse performance, as can be seen in the chart above.

A possible explanation for this would be due to the discrepancy in imbalance ratio (IR). James’ data set is severely imbalanced with an imbalance ratio of 2%, whereas imbalance ratio of UCI dataset is 25%. A hypothesis is that, ceteris paribus, sampling techniques perform differently under different levels of imbalance.

What’s next?

The next step for us would be testing the same experiments on more data set. As you can see, the results from UCI and James’ data set are so divergent that it is not possible to draw any conclusion. In addition, it is also suggested to test imbalanced techniques under a range of imbalance ratio (e.g. 15%, 10%, 5%, 2% and 1%).

These two suggested experiments will allow us to understand how imbalanced techniques perform under different conditions. Ideally, we’d hope to be able to identify the most effective technique given the circumstances, and help our clients achieve the lowest default rates.

This work is one among many steps we have taken at the R&D team to challenge status quo and established practices in the industry.

This is however easier said than done, as it is very tempting to do something just because everyone does it

That’s why, at James, we try to always question our thinking, and use data to make informed decisions.

If this is something you also strive for, don’t forget to subscribe to James Tech Blog and be a part of this exciting journey!

Reference

Beckmann, Ebecken, F. and Pires de Lima, D. (2015) A KNN Undersampling Approach for Data Balancing

Brown, I., Mues, C. (2012) An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications 39(3) 3446–3453

Crone, S. and Finlay, S. (2012) Instance sampling in credit scoring: An empirical study of sample size and balancing

Drummond, C. and Holte, R. C. (2003) Cost-sensitive Classifier Evaluation using Cost Curves

Elkan, C. (2001) The Foundations of Cost-Sensitive Learning

More, A. (2016) Survey of resampling techniques for improving
classification performance in unbalanced datasets

Vicente Garcia, Ana I. Marques, and J. Salvador Sanchez (2012) Improving Risk Predictions by Preprocessing Imbalanced Credit Data

Wilson, D.L. (1972) Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Communications, 2, 408–421

--

--