Strategies for Imbalanced Classes
Many real-world problems involve imbalanced classes, but most classification algorithms don’t handle them well out of the box. Sometimes they may even completely fail to predict any instances of the less common class.
So how can we approach these problems?
1. Use a custom threshold
One of the simplest strategies is to use a custom probability threshold for the under-represented class. For binary models, the default is typically 0.5 but this is rarely an appropriate threshold for rare classes. Setting a custom threshold is easy to accomplish with a Precision-Recall Curve. With this metric, precision and recall are calculated for different probability thresholds. You can then select the threshold value that gives acceptable metric values for the target class, or meets some other business criteria (such as the top X% of predictions).
For all of the other techniques that follow, they are frequently used in combination with custom thresholds.
2. Oversample the smaller class(es)
Increasing the frequency of the smaller class in the data set is another simple strategy. There are a variety of techniques available for oversampling, but two key ones to understand are:
- Random oversampling: a naive approach in which random instances of the class are duplicated. This can lead to overfitting.
- Synthetic Minority Oversampling Technique (SMOTE): new data points are created by leveraging nearest neighbors. This algorithm takes a data point and its nearest neighbor, and creates a point that lies on the line between them. Disadvantages of SMOTE include the fact that the neighbors are chosen blindly, and they can increase the noise in the data.
Several variations of SMOTE exist based on KMeans, SVM, the adaptive synthetic algorithm and other advanced strategies. For Python, they are available in the imbalanced-learn library.
3. Undersample the larger class(es)
As with oversampling, there are a few basic approaches to undersampling the classes that have more data:
- Random undersampling: random instances are dropped from the data set. This can be done with or without replacement (bootstrapping). This approach risks losing information.
- Near Misses (keep approach): uses nearest neighbors to prioritize the data points that are closest to the underrepresented class instances and are thus harder to learn.
- Tomek Links (delete approach): identifies pairs of nearest neighbors that belong to different classes and removes the one from the larger class.
- One-Sided Selection (OSS): combines keep and delete approaches to remove ambiguous examples as well as ones that are too far away to be useful.
Machine Learning Mastery has a thorough overview of the specific techniques that I recommend if you want to learn more.
4. Data augmentation and synthetic data
SMOTE is a synthetic data technique, however there are other approaches that are worth highlighting that did not originate as oversampling algorithms. These approaches include:
- If the distribution of the data is known or can easily be modeled, the distribution can be used to generate new examples.
- Using GANs and variational autoencoders to generate data.
- The image augmentation techniques common in computer vision problems such as rotating images, adding noise, and cropping.
5. Class weighting
Some classification algorithms support class weighting. This involves giving more weight to the underrepresented class in calculating the loss function or split purity for decision trees. Common weighting schemes include balanced weighting, in which the weight is proportionate to the inverse of the class frequency, or providing a custom mapping of {class: weight}.
6. Use different metrics
Accuracy and ROC Curves are common classification metrics, but they can be misleading when used with imbalanced classes. For example, if a model is predicting 100% of the data to belong to the larger class, and the smaller class only has a 2% frequency, that would give you 98% overall accuracy despite having 0% recall.
The Area Under the Precision Recall Curve (AUCPR) and F1 score give better representations of model performance for imbalanced classes. AUCPR is more sensitive to changes in the false positive rate than the ROC curve AUC, and F1 score is the harmonic mean of precision and recall.
7. Use custom metrics
In some cases you may want to create a custom metric that weights the classes differently or provides some custom combination of metrics appropriate to the specific use case. As an example, this Kaggle competition on predicting credit defaults uses a custom metric for ranking customers based on their likelihood of defaulting based on a combination of normalized Gini coefficient and the recall in the top 4% of predictions (similar to the custom threshold approach previously mentioned).
Conclusion
As with machine learning algorithms in general, there is no one best strategy for imbalanced data. They each have their advantages and disadvantages. You will have to experiment to determine which ones give the best results for your data set and very often you will end up using a combination of them.
Imbalanced classes and other key concepts for working with models in real-world settings are covered in my Machine Learning Flashcards: Modeling Core Concepts deck. Check it out on Etsy!