Using Over-Sampling Techniques for Extremely Imbalanced Data
In the previous post “Using Under-Sampling Techniques for Extremely Imbalanced Data”, I described several under-sampling techniques to deal with extremely imbalanced data. In this post, I describe over-sampling techniques to attack the same issue.
I have written articles on a variety of data science topics. For ease of use, you can bookmark my summary post “Dataman Learning Paths — Build Your Skills, Drive Your Career” which lists the links to all articles.
Oversampling increases the weight of the minority class by replicating the minority class examples. Although it does not increase information, it raises the over-fitting issue, which causes the model to be too specific. It may well be the case that the accuracy for the training set is high, yet the performance for new datasets is worse.
(1) Random oversampling for the minority class
Random oversampling simply replicate randomly the minority class examples. Random oversampling is known to increase the likelihood of occurring overfitting. On the other hand, the major drawback of Random undersampling is that this method can discard useful data.
(2) Synthetic Minority Oversampling Technique (SMOTE)