Imbalanced Data: an extensive guide on how to deal with imbalanced classification problems

Lavinia Guadagnolo

Published in

Eni digiTALKS

9 min readMay 3, 2022

An in-depth analysis on data-level, algorithm-level, and hybrid approaches to face imbalanced classification problems.

Imbalanced classification problems and the accuracy paradox

Anyone who is familiar with machine learning has certainly come across the problem of imbalanced classification. By definition, imbalanced classification occurs when one or more classes have very low proportions in the training data as compared to the other classes. When the distribution of example is uneven by a large amount (≥ 1:100) there is a severe imbalance.

Going into details, there are two main aspects in imbalanced classification problems: minority interest and rare instances. Minority interest refers to the fact that rare instances are the most interesting ones: in problems such as fraud detection or churn prevention, the minority class is the one to be addressed. On the other hand, rarity refers to the fact that data belonging to a particular class are represented with low proportions with respect to the other classes. Most of imbalanced classification problems arise from a combination of these two factors, such as predicting rare diseases. In such defined situations, most of the common machine learning algorithms fail since they are designed to improve accuracy. In imbalanced classification problems, models can have high accuracy, while performing poorly on the minority class. So how can we face this challenge?

In such defined situations, most of the common machine learning algorithms fail since they are designed to improve accuracy. In imbalanced classification problems, models can have high accuracy, while performing poorly on the minority class. So how can we face this challenge? There are several approaches, which can be classified in 3 categories:

data - level methods:
- oversampling techniques
- undersampling techniques
agorithm - level methods
hybrid methods.

Choosing the right approach is crucial, a wrong choice might bring information loss or lead to overfitting.

In this article we will focus on different approaches to avoid these risks. First, we will present SMOTE and its extensions, techniques aimed at resampling the dataset; then we will give a brief overview of two families of algorithms that are suitable for imbalanced datasets, and finally we will present possible hybrid approaches combining the previous ones.

1. Data-level methods

Data-level approaches aim at rebalancing the training dataset before applying machine learning algorithms. This can be done in two different ways:

Oversampling: creating new instances of the minority class
Undersampling: deleting instances of the majority class

The two approaches are explained in Figure 1.

Figure 1 — Undersampling and oversampling effects in rebalancing.

There are several data-level methods, ranging from random oversampling / undersampling to more complex approaches. In the following paragraphs we will focus on SMOTE and its extensions.

1.1 SMOTE

SMOTE (Synthetic Minority Oversampling Technique) is an oversampling technique that was first presented in 2002 (Nitesh V. Chawla, 2002) and it’s based on the idea of generating new samples of the minority class by linear combination.

Let’s see this approach in detail:

An 𝑥ᵢ minority class instance is selected as root sample for new synthetic samples
𝑥ᵢ ’s 𝐾 nearest neighbors are obtained
𝑛 of the 𝐾 instances are randomly chosen to compute the new instances by interpolation
The difference between 𝑥𝑖 and the selected neighbors is considered
𝑛 synthesized samples are generated by the following formula:

where 𝑔𝑎𝑝(𝑗) is a uniformly distributed random variable from (0,1) for the j-th feature, and 𝑛 is the amount of oversampling.

Figure 2 — Example of imbalanced dataset before and after SMOTE is applied. Python code for implementation source: https://machinelearningmastery.com

As it’s apparent from the plots, after SMOTE application the two classes are more balanced, and the new generated samples are close to the original ones from the minority class.

1.2 SMOTE Extensions

At first sight, focusing on the first plot, we can identify different areas.

Figure 3 — Example of imbalanced dataset highlighting different regions

The green circled area, in which most of the instances belong to the minority class
the yellow circled area, where we can find in almost equal proportions instances of the majority and the minority class
and the red circled area where most of the points belong to the majority class.

We define the second one as the borderline area. Instances located in this area are the most difficult to predict. Therefore, some extensions to SMOTE have been developed, based on the idea of selecting target samples, before generating new ones.

1.2.1 Borderline SMOTE

Borderline SMOTE is one of these extensions, which focuses on the “danger” region. The approach is the following one:

An 𝑥ᵢ minority class instance is selected as root sample for new synthetic samples
𝑥ᵢ ’s 𝑚 nearest neighbours are obtained
The number of majority samples (𝑚′ ) among the nearest neighbors of 𝑥ᵢ is calculated
- If 𝑚′ = 𝑚 → the sample is considered as NOISE
- If 𝑚/2 ≤ 𝑚′<𝑚 . It means that the number of majority neighbors is higher than the minority samples → the sample is considered as DANGER
- If 0 ≤ 𝑚′<𝑚/2 the number of majority neighbors is lower than the number of minority samples → the sample is considered as SAFE.

Let 𝑝𝑛𝑢𝑚 be the number of minority instances and 𝑑𝑛𝑢𝑚 be the number of instances in danger:

For each sample in danger:

- A 𝑝ᵢ′ minority class instance is selected as root sample for new synthetic samples
- 𝑝ᵢ′ ’s 𝑘 nearest neighbors are obtained
- 𝑠 ∗ 𝑑𝑛𝑢𝑚 of the 𝑘 instances are randomly chosen to compute the new instances by interpolation
- The difference between 𝑥ᵢ and the selected neighbors is considered
- 𝑠 ∗ 𝑑𝑛𝑢𝑚 synthesized samples are generated by the following formula:

Where, 𝑑𝑖𝑓ⱼ is the difference between 𝑝ᵢ′ and its j nearest neighbor, 𝑟ⱼ is a uniformly distributed random variable from (0,1), 𝑠 is the amount of oversampling and 𝑘 is the number of nearest neighbors. The result of borderline SMOTE is shown in the following plots:

Figure 4 — Example of imbalanced dataset before and after borderline SMOTE is applied

Similarly to SMOTE, the technique for generating new samples is a linear combination (oversampling technique). But, in contrast with the previous approach, new samples are generated almost only along the borderline area. In this way more samples are generated in the most critical region, ignoring the areas defined as “safe” or “noise”.

1.2.2 Adasyn

However, in some cases, the noise area needs to be noted and addressed. The Adasyn approach (Haibo He) focuses on this problem. The idea behind this approach is to generate more minority samples (oversampling), in areas where their density is lower. The procedure is the following one:

Let 𝑚ₛ be the number of minority class examples, 𝑚ₗ be the number of majority class examples, 𝑑ₜₕ be the threshold for the maximum tolerated degree of class imbalance, and 𝛽 a parameter used to specify the desired balance.

Calculate the degree of class imbalance 𝑑 = 𝑚ₛ ⁄ 𝑚ₗ

if 𝑑<𝑑ₜₕ
- Calculate the number of synthetic data examples that need to be generated 𝐺 = (𝑚ₗ — 𝑚ₛ ) x 𝛽
- For each example 𝑥𝑖 in the minority class, find 𝐾 nearest neghbors and calculate 𝑟ᵢ = ∆ᵢ ⁄𝐾 , 𝑖 = 1, . ., 𝑚ₛ where ∆ᵢ is the number of examples in K nearest neighbor that belong to the majority class
- Normalize 𝑟ᵢ s.t. s.t. 𝑟̂ᵢ = 𝑟ᵢ /∑ 𝑟𝑖 =1, s.t. 𝑟̂ᵢ is a density distribution
- Calculate the number of synthetic data example that needs to be generated 𝑔ᵢ= 𝑟̂ᵢ × 𝐺

For each minority class data example generate 𝑔𝑖 synthetic samples:

Figure 5 — Example of imbalanced dataset before and after Adasyn is applied

According to this approach, examples with the most class overlap have the most focus. Hence, on problems where these low-density examples might be outliers, the ADASYN approach may put too much attention on these areas of the feature space, which may result in worse model performance. It may help to remove outliers prior to applying the oversampling procedure, and this might be a helpful heuristic to use more generally.

Finally, an interesting integration with under sampling techniques can furtherly improve dataset rebalancing.

1.2.3 Integration with Tomek Link

Tomek link (Luo Ruisen), is an interesting under sampling technique which avoids information loss. The approach is as follows:

For two instances 𝑥ᵢ , 𝑥ⱼ
- If for any 𝑥ₖ ∈ 𝑋 {𝑥ᵢ , 𝑥ⱼ }

𝑥ᵢ , 𝑥ⱼ are called Tomek link.
- If 𝑥ᵢ , 𝑥ⱼ belong to different classes → The one belonging to the majority class is removed.

Therefore, instances belonging to the majority class, which are close to the ones belonging to the minority class (and might be misinterpreted) are removed.

Figure 6 — Example of Tomek link application on an imbalanced dataset

Integrating this undersampling technique with SMOTE can be beneficial in rebalancing the dataset, since noise is reduced and new samples are created.

2. Algorithm-based approaches

After having resampled the data, it is crucial to apply the right algorithm according to the new distribution and according to the aim of the project. The most effective types of algorithms for imbalanced classification are ensemble techniques. The idea behind them is to train multiple models (weak learners) and combine their results to obtain predictions. Ensemble algorithms improve stability and accuracy of common machine learning algorithms.

Ensemble algorithms can be grouped in two categories: bagging and boosting techniques.

Bagging states for Bootstrap Aggregating. In this technique different training datasets are generated by random sampling with replacement from the original dataset, so some observations may be repeated in different datasets. These new sets are used to train the same learner algorithm and different classifiers are produced. Different learners are trained independently and in parallel. Then the predictions are obtained by averaging the results of all the learners.

Similarly, in boosting, different training datasets are generated by random sampling with replacement but, in contrast with bagging, each classifier considers the previous classifiers’ success: samples that are misclassified get higher weight, so the different models are trained sequentially and adaptively. Furthermore, in boosting techniques the final vote is obtained considering the weighted average of the estimates.

Both techniques are good at reducing variance and increase stability but, while boosting mainly aims at reducing bias, bagging reduces variance, so it may solve the overfitting problem. The table below sums up strengths and weaknesses of both the approaches.

Table 1 — Advantages and disadvantages of Algorithm based techniques

3. Hybrid approaches

When there is a severe imbalance, it is advisable to apply hybrid techniques, combining data-level and algorithm-level approaches to leverage the potential of both the techniques. The following pictures illustrate two possible hybrid approaches.

Figure 7 — Hybrid approach integrating oversampling, undersampling techniques and boosting algorithm

Figure 8 — Hybrid approach integrating bagging, undersampling techniques and boosting algorithm

In Figure 7 two data-level approaches are applied to rebalance the dataset: Borderline SMOTE and Tomek link for under sampling. Then XGBoost is applied and scores are obtained. In Figure 8 different models are trained in parallel on different dataset. These are obtained selecting all the instances belonging to the minority class and a subset of the instances of the majority class. In this way the degree of imbalance is reduced. After that, Tomek link is applied to furtherly balance the dataset, removing misleading samples, and finally the XGBoost algorithm is trained on each subset. The probability score is obtained by averaging the results of the different algorithms.

Conclusion

So, which sampling technique apply? Which algorithm to choose? In this article we presented different possibilities, however there is no single right answer to these questions. It depends on the problem itself and on the severity of the imbalance. It is suggested to do a careful preliminary and exploratory analysis, verifying data distribution, the presence of outliers and deepening their meanings. After restricting the panel of possible approaches, according to these elements, it is possible to test and compare them, in order to find the most suitable one for the problem.

References

Haibo He, Y. B. (n.d.). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced. Conference: Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence).

Luo Ruisen, D. S. (n.d.). Bagging of Xgboost Classifiers with Random Under-sampling and Tomek Link for Noisy Label-imbalanced Data. IOP Conference Series: Materials Science and Engineering. Chengdu, China.

Nitesh V. Chawla, K. W. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16, 321–357