How to handle Class Imbalance Problem
Some tips and machine learning methods to deal with imbalanced dataset
Class imbalance problem affects the quality and reliability of results in machine learning task and, for this reason, it should be managed by specific techniques and measures of quality.
The interest in this area, in the last years, has been increasing as the application of machine learning techniques is becoming greater in different fields. In literature and in real applications, class imbalance problem is managed by three main methods:
- Data preprocessing approach
- Algorithmic approach
- Feature selection approach
But first of all…what is a class imbalance and when does it become a problem?
Imbalance datasets are characterized by a rare class, which represents a small portion of the entire population (1 out of 1000 or 1 out of 10000 or even more). Class imbalance can be intrinsic to the problem, it is imbalanced by its own nature, or it can be determined by the limitation of data collection, caused by economic or privacy reasons.
The minority class is scarce and its own characteristics and own patterns are scarce as well, but those information is extremely important for the trained model to discriminate the small samples from the crowd. Standard classification algorithms, that don’t take into account class distribution, are overwhelmed by the large class and they ignore and misclassify the minority one: there aren’t enough examples to recognize the patterns and the properties of the rare class. Those models act like Zero Rule algorithms because their output is simply the most frequent class in dataset.
If your goal is to predict rare but important diseases in medical dataset, spam emails, fraudulent behaviors or text classification, algorithms that behave like Zero Rule can’t be taken into account as solution.
Don’t slip on the evaluation metrics!
Suppose to train a machine learning model to discern non-spam emails from spam emails. The entire dataset is composed of 44 emails, including 40 non-spam emails and 4 spam emails. The model used is a standard algorithm and doesn’t take into account the class distribution. The result achieved is the following:
The model obtains 90.9% of accuracy. Great result! But is it a good model? Obviously not! The model acts like a Zero Rule model: only the majority class is found, while the rare class, that is more interesting, is ignored.
Accuracy evaluates all the classes as equally important and that’s why it can’t be used as measure of goodness for models working on imbalanced class dataset.
Other metrics are necessary, such as:
- F1 Measure
- ROC curve and AUC
Which one is better? There isn’t a better metric. It depends on many factors, such as the goal, the context, and the cost function: is it better to classify correctly one more unit of the rare class but, at the same time, increasing False Positive errors (classify no-spam email as spam email), or misclassify some units of the rare class, but decreasing False Positive errors?
Resampling is a preprocessing method, thus it’s used before the training of the learning model. It aims at changing the class distribution in order to reach the optimal one. There are two types of resampling:
📌Natural Resampling: the main goal is to collect more data of the minority class. It’s simple to obtain, but it’s not always possible. In dataset where the imbalance problem is part of its own nature, it’s hard to collect as data as the majority class.
📌Artificial Resampling: it can be accomplished by:
- undersampling, by reducing data of the majority class
- oversampling, by replicating the minority class
- SMOTE (Synthetic Minority Oversampling TEchnique). It is a synthetic minority oversampling technique, which makes synthetic data points by finding the nearest neighbours to each minority sample.
The random sampling is the easiest method to apply under and oversampling. The final class distribution obtained by the resampling method can be fixed to a fully balanced distribution or parameterized to any ratio.
Undersampling and Oversampling are easily used in real applications, but they have some cons:
- Undersampling can cut out some important and valuable information from the dataset. To overcome the loss of information caused by undersampling, a cluster-based undersampling approach can applied. It uses the k-means alghoritm: the clusters created have the smaller variance within and the most between-class variance, thus the records inside the cluster share the same characteristics. The majority class is undersampled by taking only the centroids of the clusters created.
- Oversampling, on the other hand, can lead to overfitting. Although, it’s been proved that adjusting the class distribution to the optimal one can improve drastically the performance, but find the best distribution is really difficult. Some dataset are more reactable to fully balanced distribution class, other instead gets greater performance with less skewed dataset. The researcher should find the better solution by trial and error and some heuristics.
The algorithmic approach offers another solution to class imbalance problem.
📌 Cost-sensitive learning method. It assigns misclassification costs into the learning algorithms. The main goal is to minimize a loss function and encourages the algorithm to favor the minority class.
📌One-class classification. It differs from the other classification methods due to change of the classification target to one-class target. It takes into account only one class, the new target, and ignores the others classes, the outliers. Generally this method is used to single out an error or unwanted objects, but also to deal with imbalanced dataset.
📌Ensamble method. It combines predictions of different classifiers. Each individual classifier is trained with a random subsets. The resample methods used are principally two, bagging and boosting.
- Bagging method trains all classifiers with different bootstrap of the dataset. A bootstrap is a random subset of N samples and they are replaced more times. Once the models are trained, the final output is the majority vote of the classifiers.
- Boosting, uses the most difficult class to predict, to train the classification models. The first classifier is trained with a random sample of the dataset. The class that has more misclassified records will be the majority class in the next sample. And so on.
The boosting method, generally, reaches better performance than the bootstrap sampling, but on the other hand, boosting’s performance is not always better than a single classifier.
Feature Selection method
High dimensionality in dataset can badly affect the model’s performance. Thus it becomes important to select the most valuable features and exclude the noisy ones. There are three types of feature selection methods:
📌metric, it ranks features by their own individual payoff to classify the records.
📌wrapper, it trains the learning model on a subsets of features. The model’s performance is used to choose the most useful subsets of features. The feature subsets are trained more times, that could be a problem if the dataset has a high dimensionality.
📌embedded, it aims to choose the best subset of features inside the model. The recursive feature elimination algorithm in SVM is an example.
📌RELIEF, it is an algorithm approach. It uses the nearest neighbour method to find the features for which their own instances stand out from the nearby points.
Embedded and wrapper take into account the interactions among the features, but with high dimensional dataset the run-time increases drastically. Metric method is the best one to deal with high dimensional dataset, even if it doesn’t count the relations among features, but only their individual contribution.
In this post, we have seen some tips and methods to handle imbalanced dataset. Please note the following brief recap:
- The metric of goodness can’t be the standard accuracy, but other metrics are recommended.
- The choice of the right metric is based on the target goal and the cost of misclassified records.
- The methods outlined are principally three: algorithm method, data preprocessing method and feature selection method. All of them have some cons and pros and an optimal approach does not exist yet , therefore the researches should find the method which fits more the goal target and characteristics of the dataset.