Handling Imbalanced Data by Oversampling with SMOTE and its Variants

Ansh Bordia
Analytics Vidhya
Published in
10 min readFeb 25, 2022

In this post I’ll explain oversampling/upsampling using SMOTE, SVM SMOTE, BorderlineSMOTE, K-Means SMOTE and SMOTE-NC. I’ll follow the explanations with a practical example where we apply these methods to solve an imbalanced Machine Learning problem to see their impact.

Photo by Tingey Injury Law Firm on Unsplash

Introduction

When working on Machine Learning problems one of the first things I check is the distribution of the target class in my data. This distribution informs certain aspects of how I go about solving the problem. I often see some sort of imbalance in the data and sometimes this imbalance is not significant (for simplicity, assume 60:40 for a binary classification task) while sometimes it is (say, 98:2). Now when this imbalance is insignificant, my life is much easier; but, I’ve learnt to accept that’s not always the case. For those who have worked with such datasets (e.g. churn modelling, propensity modelling etc.), you’d know how poor the model performance can be at correctly identifying samples of the minority class. This is where it is essential to understand how you can deal with this type of problem to get the best possible outcome.

There are several ways you can go about solving this problem, and today we’ll talk about oversampling (a.k.a. upsampling). Simply put, oversampling is increasing the number of minority class samples. However, it’s not just about replicating samples of the minority class (which can be effective at times!) but instead using some more sophisticated methods. These sophisticated methods include the popular SMOTE and several useful variants (which I feel aren’t talked about much) derived from it.

I’ll first explain how these methods work and then we’ll cover a small practical example where we apply these methods and see how they perform. Let’s get started then!

Methods

SMOTE (Synthetic Minority Oversampling Technique)

SMOTE [1] follows a very simple approach:

SMOTE
SMOTE
  1. Select a sample, let’s call it O (for Origin), from the minority class randomly
  2. Find the K-Nearest Neighbours of O that belong to the same class
  3. Connect O to each of these neighbours using a straight line
  4. Select a scaling factor ‘z’ in the range [0,1] randomly
  5. For each new connection, place a new point on the line (z*100)% away from O. These will be our synthetic samples.
  6. Repeat this process until you get the desired number of synthetic samples

While the idea of generating new synthetic samples over duplicating original samples is a step-up, SMOTE has one main weakness. If the point(s) selected in either steps 1 or 2 are located in a region dominated by majority class samples, the synthetic points might be generated inside the majority class region (which may make classification difficult!). To know more, Dr. Saptarsi Goswami and Aleksey Bilogur have provided informative vizualizations for this phenomenon.

BorderlineSMOTE

BorderlineSMOTE [2] works similarly to traditional SMOTE but with a few caveats. In order to overcome the shortcoming of SMOTE, it identifies two sets of points — Noise and Border. What are these peculiar points you ask? A point is called “Noise” if all its nearest neighbours belong to a different class (i.e. the majority). On the other hand, Border” points are those that have a mix of majority and minority class points as their nearest neighbours.

When the sampling is done (SMOTE -> Step 1), only the Border points are used. Afterwards, when finding the nearest neighbours, the criteria of selecting only points belonging to the same class is relaxed to include points belonging to any class. This helps select points that are at risk of misclassification and new points closer to the boundary. Everything else is the same.

But this solution is not a silver bullet. Restricting sampling from just border points and relaxing the neighbourhood selection criteria need not work in every scenario. However, that’s a topic for another day.

K-Means SMOTE

This is a fairly recent method [3] and aims to reduce the noisy synthetic points that other oversampling methods generate. The way it works is fairly straightforward:

  1. Do K-Means Clustering on the data. (What is K-Means Clustering?)
  2. Select clusters that have a high proportion (>50% or user-defined) of minority class samples.
  3. Apply conventional SMOTE to these selected clusters. Each cluster will be assigned new synthetic points. The number of these generated points will depend on the sparsity of the minority class in the cluster; the more the sparsity, the more new points.
K-Means SMOTE
K-Means SMOTE

In essence, this method helps create clusters of the minority class (that are not greatly influenced by other classes). This can ultimately benefit the ML model. However, it inherits the weaknesses of the K-Means algorithm — such as finding the right K, among others.

I guess now is a good time to chuckle at a sampling meme (which should make sense after reading the above methods).

Image generated using imgflip

SVM SMOTE

SVM SMOTE [4] focuses on increasing minority points along the decision boundary. The argument behind this is that instances around this boundary are critical for estimating the optimal decision boundary (which contrasts with the K-Means method we saw earlier but aligns with the Borderline variant).

So this is how this method works:

  1. Train an SVM on your data. This will give you the support vectors (we focus on the minority class support vectors). (What is an SVM?)
  2. We then use these support vectors to generate new samples. For each of the support vectors, we find its K-Nearest Neighbours and create samples along the line joining the support vector and the nearest neighbours using either interpolation or extrapolation.
  3. If less than half of the nearest neighbours belong to the majority class, then we do extrapolation. This helps expand the minority class area towards the majority area. If not, we do interpolation. The idea here is that since the majority of the neighbours belong to the majority class, we’d instead consolidate the current area of the minority class.

If you’re finding it hard to grasp point 3, try visualizing it in your head or draw it on a piece of paper so you can see how it would look like. This should make things more apparent. Have a look at the research paper [4] listed at the end of this post and validate your understanding.

Note: When finding the nearest neighbours we consider all the classes, but when joining the minority support vector to these neighbours, we consider only the minority ones.

This method works well when there is a low degree of overlap between the classes.

SMOTE-NC

If you’re wondering how the above methods deal with categorical variables (without having to do some form of encoding) you’re on the right track. All the methods above work only with numerical data. While many problems are unanswered in this world, dealing with categorical data isn’t one of them.

SMOTE-NC (N for Nominal and C for Continuous) [1] can be used when we have a mixture of numerical © and categorical (N) data. To understand how this method works, I’ll be answering two questions. Pause for a moment and think what they can be.

Question 1: Given that we have categorical variables, how do we compute the distance when finding the nearest neighbours?

Answer 1: We compute a constant M (think of this as a penalty term) which is the median of the standard deviations of the numerical features of the minority samples. See the illustration below:

SMOTE-NC: Calculating Distance

Question 2: How do we assign categories to new synthetic points?

Answer 2: To get numeric features of a new synthetic sample, we use traditional SMOTE. However, to get a categorical feature, we assign the value occurring in the majority of the K-Nearest Neighbours (all belonging to the minority class).

So now we know how to find the nearest neighbours (Step 2 of SMOTE) and assign categories. The rest is basically the SMOTE algorithm.

Note: If your data has only categorical features, then you can use SMOTE-N. However, such a scenario should be rare.

Practical Example

Implementing these methods is relatively straightforward (although you’ll need to look at some finer details in the research papers — listed at the end of this post). There is an excellent package called Imbalanced Learn that already implements all these methods and it’s super easy to use. But, if you have time, why not do your implementation and see how it compares to the package.

So, coming to the example. We’ll use a telecom customer churn dataset since it is inherently imbalanced. This is a binary classification problem where a customer either leaves (i.e. churn) or stays.

We’ll compare the classifier performance for the following strategies:

  1. No oversampling
  2. Random Oversampling
  3. SMOTE
  4. BorderlineSMOTE
  5. SVM SMOTE
  6. K-Means SMOTE

Since this is an imbalanced problem, classifier accuracy is not the best metric for comparison. Instead, I am going to focus on recall on the minority class (churn). Recall lets us understand how good the classifier is at correctly identifying customers who churn. We can also focus on precision and F1 but that depends on the business requirements and what the focus is (minimizing false negatives or false positives or both). Read this post to know more about these metrics and get a basic intuition on when to choose what.

The entire notebook can be found here. I am only going to show the code relevant to oversampling below.

#Imports
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import BorderlineSMOTE
from imblearn.over_sampling import SVMSMOTE
from imblearn.over_sampling import KMeansSMOTE
#Random Oversampling
random_os = RandomOverSampler(random_state = 42)
X_random, y_random = random_os.fit_resample(X_train, y_train)
#SMOTE
smote_os = SMOTE(random_state = 42)
X_smote, y_smote = smote_os.fit_resample(X_train, y_train)
#BorderlineSMOTE
smote_border = BorderlineSMOTE(random_state = 42, kind = 'borderline-2')
X_smoteborder, y_smoteborder = smote_border.fit_resample(X_train, y_train)
#SVM SMOTE
smote_svm = SVMSMOTE(random_state = 42)
X_smotesvm, y_smotesvm = smote_svm.fit_resample(X_train, y_train)
#K-Means SMOTE
smote_kmeans = KMeansSMOTE(random_state = 42)
X_smotekmeans, y_smotekmeans = smote_kmeans.fit_resample(X_train, y_train)

Exercise for the Readers:

The oversampling I’ve done brings the ratio between churn and non churn to 1. But we can vary that, along with a lot of other things, as per requirement using the hyperparameters of the samplers (refer to these docs). You can play around with these hyperparameters and see how they alter the resultant sample.

On completion, I got the following results:

Comparison of Oversampling on Recall

Conclusion:

1. Some form of oversampling is better than no oversampling for this problem.

2. BorderlineSMOTE beats the other methods by a good margin while SMOTE, SVM SMOTE and Random Oversampling are relatively the same. As I said before, Random Oversampling can lead to favourable results at times.

3. K-Means SMOTE gives the worst results out of all the oversampling methods. However, you shouldn’t dismiss this method for future use. Any method can outperform the other based on the problem type and data distribution. Perhaps, in this case, the data we selected wasn’t suitable for K-Means.

Note: In the notebook, I’ve plotted the resultant data distribution from each of the compared methods. Have a look at that and try to form a visual intuition to explain why we get these results (especially for BorderlineSMOTE and K-Means SMOTE).

I hope this post helped you understand some basic and advanced methods of oversampling and how you can use them tackle imbalanced data better. It is important to note that there are many ways in which you can tackle imbalanced data, such as, undersampling (a.k.a. downsampling) and class weights. Good ML practices such as EDA, feature selection & engineering, model tuning etc. also go a long way in solving these problems.

As I said previously, there is no silver bullet to this imbalance problem. If you understand the various ways you can tackle the problem, you’d be able to make more informed decisions on what you can do, what you should do and save time to get a better performing model.

That’s all folks, let’s end this post with a nice sampling meme (and lesson).

Image Generated using imgflip

Reach out to me on LinkedIn for any relevant questions or discussions related to this article.

Acknowledgement

This article was written on behalf of Intellify Australia — the company where I work as a Data Scientist.

At Intellify we use data, analytics and machine learning to help
solve our clients’ most challenging business problems. Please visit our website to know more.

References

[1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321–357, 2002.

[2] H. Han, W. Wen-Yuan, M. Bing-Huan, “Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning,” Advances in intelligent computing, 878–887, 2005.

[3] Felix Last, Georgios Douzas, Fernando Bacao, “Oversampling for Imbalanced Learning Based on K-Means and SMOTE”

[4] H. M. Nguyen, E. W. Cooper, K. Kamei, “Borderline over-sampling for imbalanced data classification,” International Journal of Knowledge Engineering and Soft Data Paradigms, 3(1), pp.4–21, 2009.

--

--

Ansh Bordia
Analytics Vidhya

I am a Data Scientist making impactful ML products