Addressing imbalanced datasets with SMOTE

Photo by Elena Mozhvilo on Unsplash

If you do not know how to ask the right question, you discover nothing. — W. Edward Deming

There are many real life applications in which we encounter datasets with uneven distribution of samples across target labels. As minority class is usually the class which is more important and usually underrepresented, we introduce through this article a data driven method named Synthetic minority oversampling technique (SMOTE) for handling minority groups. This is one of the popular oversampling techniques for addressing imbalanced datasets.

Approaches for handling imbalance

Data level methods

Data level methods degrade the effect caused by imbalance. There are sub-approaches under this technique.

Oversampling

Oversampling is duplicating the number of minority instances to balance them matching with majority instances. These duplication are done with following approaches

Random oversampling

Randomly increasing the number of minority instances

Synthetic minority oversampling technique (SMOTE)

This technique creates synthetic samples for each minority sample depending upon its nearest neighbours. The number of selected neighbours is dependent upon the size of the minority class. The less the number of minority samples, the more numbers of synthetic samples are generated from nearest neighbours. Following steps summarizes the process finding the synthetic samples:

  1. Determine the feature and calculate its nearest neighbours
  2. Calculate the difference between them
  3. Multiply the difference with a random number in the range [0,1]
  4. Add the number to the feature
  5. Get the synthetic sample point
  6. Repeat for all vectors

We will use SMOTE for handling the imbalances in one of the benchmark dataset of wine-quality. The dataset is used to predict the quality of wine as good or bad depending upon the features in the dataset.

Let’s import all the required libraries first.

# importing required libraries
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from imblearn.over_sampling import SMOTE

Now, we will understand the distribution of data into two classes — good and bad.

f = pd.read_csv('winequality-red.csv')# convert the last column quality into categorical values [good, bad]
bins = (2,6.5,8)
labels = ['bad','good']
df['quality'] = pd.cut(df['quality'], bins=bins, labels=labels)
#mapping categorical into binary values for easy calculations
df['quality'] = df['quality'].map({'bad':0, 'good':1})
#convert the dataframe into feature and predict vectors
# X will hold all the columns except quality
# y will hold the predictable variable 'quality'
X = np.asarray(df.iloc[:,:-1])
y = np.asarray(df['quality'])
# visualise the classes
for label, items in counter.items():
row = where(y == label)[0]
plt.scatter(X[row, 0], X[row, 1], label=str(label))
plt.legend()
plt.show()
Scatter plot representing the distribution of feature ‘quality’ in redwine-quality dataset with binary classes: [good, bad]. The blue indicators representative the bad quality while orange represents the good quality

As we can see from the graph, there is a huge difference in the number of instances of the two classes.

Let’s now oversample the data with SMOTE. The library imbalance-learn is widely used in python for handling the imbalance data with SMOTE.

# create an instance of SMOTE class
oversample = SMOTE()
# oversample the data with fit_resample() function
# fit_resample(): takes the feature vector and minority vector and returns the oversampled vector
X, y = oversample.fit_resample(X, y)

Let’s visualize again so see the distribution again.

# visualise the classes
for label, items in counter.items():
row = where(y == label)[0]
plt.scatter(X[row, 0], X[row, 1], label=str(label))
plt.legend()
plt.show()
Scatter plot after applying oversampling data

Borderline-SMOTE

Borderline-SMOTE method applies oversampling to border minority instances instead of applying it to all minority instances. A subset of samples for borderline instances called DANGER is created. SMOTE is applied then for each instance of the DANGER set.

Borderline SMOTE is achieved by using the BorderlineSMOTE() function in imbalance library.

# create an instance of BorderlineSMOTE class
oversample = BorderlineSMOTE()
# oversample the data with fit_resample() function
# fit_resample(): takes the feature vector and minority vector and returns the oversampled vector
X, y = oversample.fit_resample(X, y)
counter = Counter(y)
# visualise the classes
for label, items in counter.items():
row = where(y == label)[0]
plt.scatter(X[row, 0], X[row, 1], label=str(label))
plt.legend()
plt.show()
Scatter plot to visualize the distribution of classes after applying Borderline SMOT

Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN)

ADASYN uses a density function to decide about the number of synthetic instances that may be generated for each of the minority instance.

ADASYN() class provides the direct implementation in python to handle imbalance with oversampling the minority class.

# create an instance of ADASYN class
oversample = ADASYN()
# oversample the data with fit_resample() function
# fit_resample(): takes the feature vector and minority vector and returns the oversampled vector
X, y = oversample.fit_resample(X, y)
counter = Counter(y)
# visualise the classes
for label, items in counter.items():
row = where(y == label)[0]
plt.scatter(X[row, 0], X[row, 1], label=str(label))
plt.legend()
plt.show()
Scatter plot to visualize the distribution of classes after applying ADASYN

Takeaways

We discussed through this article how SMOTE is a useful method to improve recognition rates for the samples belonging to minority groups as observed in datasets which have underrepresented samples for malicious attacks in computer security , life-threatening diseases in medicine datasets, suspicious behaviour in social networks or monitoring systems. The entire code for the oversampling techniques can be found here.

Do you have any questions?

Kindly ask your questions via email or comments and we will be happy to answer.

--

--

Insights on Modern Computation
Perspectives on data science

A Communal initiative by Meghana Kshirsagar (BDS| Lero| UL, Ireland), Gauri Vaidya (Intern|BDS). Each concept is followed with sample datasets and Python codes.