Treating Imbalanced Data for Model Building
1. Introduction
In machine learning, class imbalance refers to the disparity of the classes (1,0 for binary classification problem) in the target variable. The problem of class balance is not that rare in classification problems. The analytics domains in which this problem is more common are credit card fraud detection, anti-money laundering, e-wallet fraud, disease detection, spam filtering, etc. In most scenarios, the defective class (fraudulent transactions, spam emails, etc.) represented as ‘1’ in the target variable is the minority class. Although there is no cut-off rule generally, if the minority class is less than 5% and the majority class is greater than 95%, the problem can be considered an imbalanced classification problem. Studies show that balancing the data before model building can boost the performance of the model on unseen data. In this article, some of the re-sampling techniques to tackle the imbalance problem are discussed in detail.
2. Methods to Treat Imbalance Data
There are many ways to treat class imbalance, but some methods are more effective than others. This article discusses the common and effective re-sampling techniques in detail in the sections below.
2.1 Random Under-Sampling
One of the ways to balance the classes is to remove some of the samples from the majority class randomly. Let’s say we have 1000 data points of bank transactions, and 10 of them are fraud. This problem has a very high-class imbalance of 99:1. Using random under-sampling, we can randomly choose 1 out of 10 samples from the majority class while preserving all the minority data points. This will create a final dataset of 99 samples from the majority class and 10 samples from the minority(fraud) class. There is still an imbalance in the data, but it’s severely reduced from 99:1 to 9:1. The method can be applied in python with a few lines of code given below.
undersample = RandomUnderSampler(sampling_strategy=’majority’)
undersample = RandomUnderSampler(sampling_strategy=0.5)
X_train, y_train = undersample.fit_resample(X_train, y_train)
Although this method helps treat disparity in data, it also leads to information loss. In the above example, we end up removing 90% of the data from the majority class. This might be problematic when your data size is small. This raises a need for a better approach to treat imbalance without losing information.
2.2 Random Over-Sampling
In random over-sampling, the instances from the minority class are duplicated n-times to make the number of samples compared to the majority class. When fed into the model, this makes the model give more weightage to the minority class than the scenario with no over-sampling. This method can work well if you don’t have enough samples from the minority class.
This technique doesn’t result in any information loss as we are not eliminating any samples. One of the disadvantages of this model is that it can lead to overfitting as the model can get biased towards the minority class. Also, as we duplicate the data points, this method doesn’t add new information points for the model. This raises a pathway for a more scientific over-sampling technique discussed in the next section. The python implementation of the method is given below.
oversample = RandomOverSampler(sampling_strategy=’minority’)
oversample = RandomOverSampler(sampling_strategy=0.5)
X_train, y_train = oversample.fit_resample(X_train, y_train)
2.3 Synthetic Minority Over-Sampling Technique (SMOTE)
(Chawla et al., 2002) introduced a work of synthetic minority over-sampling technique (SMOTE) in 2002. Since then, SMOTE has been widely used by many ML practitioners with effective results. In SMOTE, new synthetic samples are created for the minority class using the k-nearest neighbors. For any point p, k-nearest neighbors are identified. The nearest neighbor is chosen, and a new synthetic point is created between the point of interest and the nearest neighbors. The process keeps on repeating till the target class is balanced. The value of k can be specified as a hyperparameter for the method.
The below table taken from the original study shows that SMOTE improves the AUC across various datasets. 50, 100, 200, 300, 400, and 500 SMOTE refers to the degree of oversampling applied on these datasets (for example 100% oversampling, etc.). The bold AUCs indicate the best performance for the given dataset. It can be inferred that applying SMOTE on top of under-sampling can boost the AUC for almost all the datasets.
The method can be implemented in python using the imblearn module. The below sample code illustrates the python implementation for SMOTE.
from imblearn.over_sampling import SMOTE
smt = SMOTE()
X_train, y_train = smt.fit_sample(X_train, y_train)
Please note that the SMOTE should be applied only to the training dataset and not the test(unseen) dataset. In addition, SMOTE should be used at last after all the pre-processing steps like data cleaning, treating collinearity and multicollinearity, scaling, normalization, etc.
3. Evaluation metrics
It is crucial to choose the right evaluation metrics while working with imbalanced datasets. For instance, in the example discussed in section 2.1, the class imbalance is 99:1. So if the accuracy is chosen as an evaluation metric, even if all the samples are classified as non-fraudulent (0), still the model achieves 99% accuracy, which is not a very accurate measure and doesn’t highlight the severity of the problem. However, metrics like precision, recall, F score, sensitivity, and specificity give a more accurate look at the problem.
4. Conclusion
Three, sampling techniques are discussed to treat imbalance datasets. The majority class under-sampling, minority class over-sampling, and synthetic minority over-sampling technique (SMOTE) are discussed in detail. The limitation of each method and the motivation for the other is discussed.
SMOTE outperforms the other methods as it does not lead to the problems like information loss and over-fitting. SMOTE has been applied to many ML systems by researchers and has been seen to improve performances.
References
Chawla, N. v, Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P., (2002) SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, pp.321–357.
Jayawardena, N., (2020) How to Deal with Imbalanced Data. A Step-by-Step Guide to handling… | by Numal Jayawardena | Towards Data Science. [online] Available at: https://towardsdatascience.com/how-to-deal-with-imbalanced-data-34ab7db9b100 [Accessed 23 Oct. 2021].
Upasana, (2017) Class Imbalance | Handling Imbalanced Data Using Python. [online] Available at: https://www.analyticsvidhya.com/blog/2017/03/imbalanced-data-classification/ [Accessed 23 Oct. 2021].