EG-SMOTE: Enhanced Geometric SMOTE
Resampling Algorithm
When the data points in one class is much higher than the other classes, data imbalance exists there. Consider an example of fraudulent transactions which has around 1000 transactions and only 10% of them are fake. These types of imbalance are now common in the machine learning problems. Imbalance data can significantly reduce the accuracy of our model. Most of the models can perform better when the dataset is equally distributed since they are designed to minimize the error and maximize the accuracy. For the imbalance dataset, you may get a high accuracy by predicting majority classes, but it will fail to predict the minority classes.
Resampling is a widely used approach for coping with very imbalanced datasets. It can be categorized to over-sampling and under-sampling. Over-sampling is the process of adding new samples the minority class, while under-sampling is the process of deleting samples from the majority class. Apart from the advantages, these methods have their own drawbacks as well. Excessive over-sampling will lead to the over-fitting and creation of noisy samples where the excessive under-sampling will lead to under-fitting by removing important features.
Techniques like SMOTE, Borderline SMOTE, and G-SMOTE are used to handle the data imbalance. EG-SMOTE is the extended version of G-SMOTE algorithm that helps to handle the imbalance is data to make the model perform better than other algorithms.
I’ll try to explain the key concepts behind EG-SMOTE in this article. But if you are new to resampling algorithms, you can watch this video to get a basic idea.
EG-SMOTE
G-SMOTE defines a geometric region to generate new samples. You can refer this paper to explore further. EG-SMOTE introduces some specific modifications in the G-SMOTE algorithm.
Let’s go through the algorithm to understand the basic concepts behind the EG-SMOTE algorithm.
Above are the parameters used in this algorithm. S(maj) and S(min) are the majority and minority samples in the dataset. Other parameters will be clearly explained in the following sections. This algorithm will be explained in six steps. The first 5 steps are given in the following snapshot.
β, α(trunc) and α(def) will be defined initially. Then the algorithm will define the number of samples to be generated by the algorithm using the equation (1). To reduce the excessive generation of the new samples skewed to a specific cluster, EG-SMOTE defines an upper limit for the number of samples that can be generated from each cluster of minority samples (Equation 2). To get a detailed explanation on the symbols used, refer to the original paper.
EG-SMOTE handles each minority samples in different approaches based on the category of the selected minority sample. The strategy used here is, considering the ratio between minority and majority neighbors (in k-nearest neighbors). Consider m as the number of majority samples in the k-nearest samples.
Consider the following conditions:
Case 1: If m=k, it can be considered as a noisy point and the algorithm restricts to synthesize samples.
Case 2: If m>3k/4, it is a noisy sample. But algorithm defines a safe geometric region where the new sample can be created (away from the majority samples).
Case 3: If m=0, it is safe to create the new samples.
Case 4: If m≤3k/4, it is safe to create the new samples by defining a geometric region (away from the majority clusters).
EG-SMOTE address the following commonly occurring issues in any resampling algorithms.
1. Handle the generation of noisy samples
When the randomly selected point lies among the majority samples, the resampling will introduce additional noisy points to the data. Therefore, EG-SMOTE does not resample this type of samples based on conditions explained in step 6.
2. Handles the generation of the samples in between minority and majority clusters
When the selected sample lies among both minority and majority samples, EG-SMOTE define the hyper sphere where the new samples can be created rather than introducing noisy points.
3. Reduce overfitting
The excessive generation of the new samples around the same region would introduce additional noises to the data. Therefore, this algorithm defines an upper limit for the samples that can be created in a specific region/cluster by using K-Means clustering.
In this article, a brief summary of the algorithm and some highlighted features are discussed. You can refer the paper to get a more technical aspect and the application in the anomaly detection. And, you can refer this GitHub repository to view the actual implementation and the experiments.
References
- EG-SMOTE: Minority Resampling Boosted Unsupervised Learning With Hyper-dimensional Computing for Threat Detection at the Edge of Internet of Things. https://ieeexplore.ieee.org/abstract/document/9530655