Imbalanced Data ML: SMOTE and its variants

An intuitive explanation of the SMOTE and its common variants for imbalanced data

An Truong
TotalEnergies Digital Factory
8 min readJun 27, 2022

--

This is the first part of an upcoming series on the effective modelling of imbalanced data for AI-based solution.

In real life usecases, we often deal with imbalanced data — when numbers of instances per class are significantly different: banking fraud detection, industrial detection of failure, disease diagnosis, etc…. When data is imbalanced, ML model tends to overfit to the overly represented behaviours in the majority classes and thus behaviours in minority classes will be ignored. As a results, when running a vanilla classifier over imbalanced data, we often have bad performance: mis-detection of fraude/failure or misdiagnosis of rare positive cases.

The imbalanced data is usually due to the lack of data for minority classes. Thus, an obvious solution will be to collect more samples for those minority classes. In practice, this is hardly possible: e.g. it is costly to induce failure to machine to collect data; waiting for many failures, frauds happen before modelling will be insensible; and not even to think about inducing disease in medical field.

In early stage of modelling for imbalanced data, random over-sampling of minority classes or under-sampling of majority classes have been used. Those methods have drawbacks of either create overfit for artificially overrepresented samples due to oversampling of minority class; or missing critical information due to undersampling of majority class. To mitigate those problems, in 2002, Chawla et al. [1] proposed a new method, to oversampling minority class by interpolating instances of the minority — called SMOTE.

What is SMOTE?

SMOTE is abbreviation of Synthetic Minority Oversampling Technique. Its main hypothesis is that the interpolated points from a close neighbor of minority instances can be considered belonging to that minority class. Given a simple binary imbalanced dataset (Fig 1-a), SMOTE will increases the number of samples in the minority class and thus help to balance the dataset (Fig 1-b) .

Fig 1: Vanilla SMOTE.

Technically, the SMOTE generates new minority points as the following (Fig 2):

a) For one minority sample find its k nearest neighbors (e.g. k = 5).
b) Randomly select s-neighbors depending on the expected oversampling rate (e.g. if need to generate two-fold more samples, s=2), called s-NN.
c) Generate new points between the sample and its s-neighbors: x_n = x_i + r*(x_j — x_i), r is a random number in (0, 1), x_j in s-NN.

Fig 2: SMOTE procedure.

How to use SMOTE?

We can directly use SMOTE class from the package imblearn. More info for imblearn can be found here.

The results will be as the Fig 3. Note how new points (red) are generated on the lines connected between minority class’s members. As showned by the counter, the count ratio between classes change from 140:60 to 140:140 with the number of minority classes now being equal to the majority class.

Fig 3 — Vanilla SMOTE

Problems with SMOTE

While SMOTE helps to balance out an imbalanced dataset more efficient than a naive method such as random oversampling, Vanilla SMOTE suffers some inherent problems.

As illustrated in the Fig 4, SMOTE can generates noisy data for casses when borderlines between classes are not well defined — which is often the case in real life dataset. Common cases for noisy minority data are:

  • isolated point deep inside majority class’s region (Fig 4-a)
  • overlap borderline points between minority and majority when the between classes’s boundary is not well defined (Fig 4-b).
  • minority classes might be composed by different clusters, e.g. with feature values very low or very high (Fig 4-c).

In those cases, new synthesised minority samples will only add more noise (Fig 4-a’, b’, c’) and thus make classification task more difficult.

Fig 4: SMOTE’s problems

To deal with these problems, many variants of the Vanilla SMOTE have been proposed. The following sections will focus into some common SMOTE variants and try to explain those in layman’s words.

SMOTETomek

SMOTETomek is a variant of SMOTE that use oversampling with SMOTE and undersampling with Tomek link [2].

What is the Tomek link? Given two samples (S_i, S_j), if they are mutual closest neighbors and they belong to two different classes, then they form a pair of Tomek link (Fig 5-a). These pairs of samples are interesting as they are often at the ambiguous borderline between classes: where it is easily to have false classification.

The idea behind SMOTETomek is to make the training dataset cleaner by removing ambigous points — Tomek links — at the borderlines. As illustrated in Fig 5-b, b’.

Fig 5: TomekSMOTE

How to use SMOTETomek?

Please notice the empty green point (Fig 6 — right) where Tomek links have been removed.

Fig 6 — SMOTETomek

ENNSMOTE

Unlike TomekSMOTE, ENNSMOTE uses Wilson’s Edited Nearest Neighbour rules (ENN) in the under-sampling step to remove instances of the majority classes [3]. As illustrated in the Fig 7, ENNSMOTE works as the following:

  • For each sample S_i, find its k-nearest neighbors (e.g. k = 3) — called kNN.
  • Calculate the ratio of majority class among the kNN — called r
  • if S_i is minority and r > 0.5, remove the majority instances (Fig 7-a, a’)
  • if S_i is majority and r < 0.5, remove S_i (Fig 7-b, b’)
Fig 7—SMOTEENN

In general, SMOTEENN removes more samples than SMOTETomek.

How to use SMOTEENN?

As mentioned, compare to SMOTETomek, SMOTEENN removes more samples (empty green circles), Fig 8.

Fig 8 — SMOTEENN for simulated data

BorderlineSMOTE

In 2005, Han et al. propose a new variant called BorderlineSMOTE [4]. Instead of removing borderline area, they proposed to specifically add more samples for those regions. Their reasoning is that, the samples in the borderline regions are hard to be classified, thus synthesizing more samples in this region will help the classifier learn to distinguish better this ambiguous region.

Basically, their method is proposed as the following:

  • Using KNN to find noise, danger or safe zone in minority class.
  • the definition of zone is base on the ratio r of majority instances inside each KNN: Take example showed in Fig 9-a, given a minority sample S_i and its KNN. if r = 1, all KNN are majority, the point S_i will be considered as noise. if r < 0.5, most of KNN are minority, thus S_i is consider in a safe zone. if 0.5 ≤ r < 1, S_i is in danger zone.
  • Oversampling only minority samples in the danger zone with SMOTE (Fig 9-b)
Fig 9 — BorderlineSMOTE

How to use BorderlineSMOTE?

Please note how the new synthesized points for the minority classes (red) only at the borderline zone.

Fig 10 — BorderlineSMOTE for simulated data

SVMSMOTE

Instead of using kNN for finding borderline zone, in 2011, Nguyen et al. [5] proposed to use SVM classifier to find supports vectors, thus borderline zone.
They also propose to use extrapolation instead of only interporlation to extend further the minority class.

SVMSMOTE works as the following:

  • Using SVM to find support vectors in the minority class (Fig 11-a).
  • For each minority support vector (sample) S_i, find its k nearest neighbor — KNN. If more than half in its kNN is from majority class, interpolate as Vanilla SMOTE to generate new minority samples. Otherwise, if more than half in its kNN is from minority class, extrapolate to generate more minority samples (Fig 11-b).
Fig11 — SVMSMOTE

How to use SVMSMOTE?

Please note in the Fig 12, some synthesized minority samples (red) are extrapolated instead of intrapolated.

Fig 12 — SVMSMOTE for simulated data

KmeansSMOTE

So far, all the discussed SMOTE variants have only used sample relative positions in the feature space,to tackle the weakness of SMOTE in generating noisy sample. KmeansSMOTE instead will take into account the general sample density [6].

As illustrated in Fig 13 for a simple two-class dataset, KmeansSMOTE works as the following:

  • Clustering the dataset into k-clusters. k is a hyperparameters that need to be tuned with the whole data modelling later on.
  • For each cluster, calculate the imbalance ratio defined as (Fig 13-b):
    r=(count(minority) + 1) /(count(majority) + 1)
  • Oversampling with SMOTE only for clusters with r > 1 (Fig 13-c).
Fig 13 — kmeansSMOTE

kmeansSMOTE can be very suitable in cases when the minority class is expected to form clusters. A simple example can be the case of detection of anomaly sensor, which might have value either very low or very high, which sandwich the majority class (whose values are in in the medium range).

How to use KMeansSMOTE?

You might want to note that some noisy minority instance (deep inside the majority class) is not being used for the generation of new point. This is done thanks to the filtering step with k-mean clustering.

Fig 14 — KMeansSMOTE for simulated data

Outlook

This post aims at giving an overview of a few commonly used variants of SMOTE. The methodologies have been intensionally simplified to explain the underlying intuition through my personal perspective. Advanced readers are invited to go deeper thought the references bellow. Compare to the total number of SMOTE variants in literature of more than 100, clearly we only start scratching the surface of the subject.

In the next session, I will cover how these SMOTE variants affect the performance of the classification as a function of classifiers and data distribution.

Stay tune and don’t hesitate to share your thought on the subject.

References

[1] Chawla, N.V. et al., 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, pp.
[2] G. Batista, R. Prati, and M. Monard. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. SIGKDD Explorations, 6(1):20–29, 2004.
[3] G. Batista, A. Bazzan, and M. Monard. Balancing Training Data for Automated Annotation of Keywords: a Case Study. Journal of artificial intelligence research, 3(2):15–20, 2003.
[4] Han, H., Wang, WY., Mao, BH. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Huang, DS., Zhang, XP., Huang, GB. (eds) Advances in Intelligent Computing. ICIC 2005. Lecture Notes in Computer Science, vol 3644. Springer, Berlin, Heidelberg.
[5] Hien M. Nguyen, Eric W. Cooper, and Katsuari Kamei. 2011. Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigm. 3, 1 (April 2011), 4–21.
[6] Last, Felix & Douzas, Georgios & Bação, Fernando. (2017). Oversampling for Imbalanced Learning Based on K-Means and SMOTE.

--

--

An Truong
TotalEnergies Digital Factory

Senior datascientist with passion for codes. Follow me for more practical tips of datascience in the industry.