Effective strategies for dealing with imbalanced datasets

Daniele Santiago
5 min readJun 1, 2023

--

Data imbalance is a critical problem in the context of machine learning and data mining. It arises when there is a significantly lower number of instances in one class than another, which can impair the performance of most machine learning algorithms designed to work with balanced datasets.

The goal of this article is to clarify how this problem occurs and present approaches to address it. By addressing this problem, we can improve the generalization capability of machine learning models and make them more practical.

How data becomes unbalanced, and what are the consequences?

Data imbalance can be an inherent characteristic of some problems, such as credit card fraud analysis, where there is expected to be a larger number of legitimate transactions than fraudulent ones. However, in other cases, the inequality in data distribution may be related to issues in data collection, such as cost, privacy, and other factors.

Regardless of the cause of imbalance, it can introduce significant challenges in the evaluation and training of machine learning models. For example, the evaluation metric may be biased towards the majority class, leading to incorrect conclusions and disadvantaging the minority class.

Additionally, machine learning algorithms like decision trees, discriminant analysis, and neural networks are designed to work with balanced datasets, which can result in a decision boundary biased towards the majority class and increase the probability of misclassifying instances from the minority classes (Nguyen et al., 2009).

Another problem related to imbalance is the impact of noise on minority classes, as machine learning algorithms may tend to treat these instances as noise and thus underestimate or discard them during model training. Moreover, the size of the dataset is an important factor to consider when building a good classifier, as the lack of examples can make it difficult to discover regularities in small classes. Generally, the more training data available, the less sensitive classifiers will be to differences between classes.

To address data imbalance, several approaches have been proposed, such as data resampling, adjusting weights, using algorithms specialized in handling imbalanced data, and other techniques. These approaches aim to improve the model’s quality by increasing the accuracy in classifying instances from minority classes and will be discussed next.

Approaches

Recognition-based

The recognition-based approach is an alternative solution to deal with data imbalance where the classifier is modeled on examples from a single class, typically the minority class, in the absence of examples from the majority class. This approach can be useful when the minority class is of special interest, such as in cases of fraud detection, anomaly detection, or rare medical diagnoses.

However, it is important to note that the recognition-based approach cannot be applied to many machine learning algorithms such as decision trees, Naive Bayes, and associative classification, as these classifiers are not built solely from samples of one class. Additionally, the recognition-based approach may lead to unsatisfactory performance in problems with imbalanced classes where the minority class is complex.

Cost-sensitive approach

Cost-sensitive learning is an approach that takes into account the cost of misclassification during the construction of a machine learning model. This approach aims to produce a classifier that minimizes the total cost of classification, rather than just maximizing the overall accuracy of the model. It leverages the fact that the consequences of misclassification can have different costs and that misclassifying a true positive instance is more costly than misclassifying a true negative instance.

A common technique in cost-sensitive learning is the cost-sensitive decision tree. In this technique, each node of the tree is split based on the cost of misclassification, rather than just maximizing the purity of the split. This can result in more balanced decision trees and better performance concerning the minority class.

However, it is important to note that cost-sensitive learning can lead to overfitting during training (Weiss, 2004), especially when the cost of misclassification is based on training data. Additionally, the proper definition of misclassification costs can be challenging and vary depending on the application context.

Sampling

Sampling is a commonly used technique to address data imbalance in machine learning problems. This technique aims to preprocess the training data to minimize the discrepancy between classes by modifying the distributions in the training set.

There are two sampling techniques: undersampling and oversampling. Undersampling involves extracting a smaller set of majority class instances while preserving all minority class instances. This technique is suitable for large-scale applications where the number of majority samples is very high. However, it is important to note that discarding data can lead to the loss of informative instances from the majority class and degrade classifier performance.

On the other hand, oversampling involves increasing the number of minority class instances by replicating them. In this case, there is no loss of information, but creating new data can lead to higher computational costs (Chawla et al., 2004). Additionally, if some samples from the small classes contain errors, adding them will deteriorate the classification performance in the minority class.

Various sampling algorithms can be applied in different contexts, depending on the characteristics of the dataset and the learning objectives. Some examples of algorithms include Cluster Centroids, which is an undersampling technique where the majority class is clustered, and then examples from each cluster are randomly selected until the dataset is balanced, and SMOTE (Synthetic Minority Over-sampling Technique), which is a widely used oversampling technique in classification problems where the minority class is significantly smaller than the majority class.

It is important to emphasize that the choice of the best class distribution will depend on the specific performance measures of the problem and may vary from one dataset to another. Additionally, a combination of undersampling and oversampling can be a viable option to balance the dataset and improve model performance in imbalanced class problems.

Conclusion

In this article, we have explored how data imbalance occurs and its consequences in datasets. We have examined the main challenges related to classifiers and discussed useful approaches to address or mitigate this issue, such as the recognition-based approach, cost-sensitive learning, and sampling.

Did you find this article helpful?

Follow me on social media:

Referências

Nguyen, G. Hoang., Bouzerdoum, A. & Phung, S. (2009). Learning pattern classification tasks with imbalanced data sets. In P. Yin (Eds.), Pattern recognition (pp. 193–208). Vukovar, Croatia: In-Teh.

--

--