Addressing imbalanced datasets with SMOTE
If you do not know how to ask the right question, you discover nothing. — W. Edward Deming
There are many real life applications in which we encounter datasets with uneven distribution of samples across target labels. As minority class is usually the class which is more important and usually underrepresented, we introduce through this article a data driven method named Synthetic minority oversampling technique (SMOTE) for handling minority groups. This is one of the popular oversampling techniques for addressing imbalanced datasets.
Approaches for handling imbalance
Data level methods
Data level methods degrade the effect caused by imbalance. There are sub-approaches under this technique.
Oversampling
Oversampling is duplicating the number of minority instances to balance them matching with majority instances. These duplication are done with following approaches
Random oversampling
Randomly increasing the number of minority instances
Synthetic minority oversampling technique (SMOTE)
This technique creates synthetic samples for each minority sample depending upon its nearest neighbours. The number of selected neighbours is dependent upon the size of the minority class. The less the number of minority samples, the more numbers of synthetic samples are generated from nearest neighbours. Following steps summarizes the process finding the synthetic samples:
- Determine the feature and calculate its nearest neighbours
- Calculate the difference between them
- Multiply the difference with a random number in the range [0,1]
- Add the number to the feature
- Get the synthetic sample point
- Repeat for all vectors
We will use SMOTE for handling the imbalances in one of the benchmark dataset of wine-quality. The dataset is used to predict the quality of wine as good or bad depending upon the features in the dataset.
Let’s import all the required libraries first.
# importing required libraries
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from imblearn.over_sampling import SMOTE
Now, we will understand the distribution of data into two classes — good and bad.
f = pd.read_csv('winequality-red.csv')# convert the last column quality into categorical values [good, bad]
bins = (2,6.5,8)
labels = ['bad','good']df['quality'] = pd.cut(df['quality'], bins=bins, labels=labels)
#mapping categorical into binary values for easy calculations
df['quality'] = df['quality'].map({'bad':0, 'good':1})#convert the dataframe into feature and predict vectors
# X will hold all the columns except quality
# y will hold the predictable variable 'quality'X = np.asarray(df.iloc[:,:-1])
y = np.asarray(df['quality'])
# visualise the classes
for label, items in counter.items():
row = where(y == label)[0]
plt.scatter(X[row, 0], X[row, 1], label=str(label))
plt.legend()
plt.show()
As we can see from the graph, there is a huge difference in the number of instances of the two classes.
Let’s now oversample the data with SMOTE. The library imbalance-learn is widely used in python for handling the imbalance data with SMOTE.
# create an instance of SMOTE class
oversample = SMOTE()# oversample the data with fit_resample() function
# fit_resample(): takes the feature vector and minority vector and returns the oversampled vector
X, y = oversample.fit_resample(X, y)
Let’s visualize again so see the distribution again.
# visualise the classes
for label, items in counter.items():
row = where(y == label)[0]
plt.scatter(X[row, 0], X[row, 1], label=str(label))
plt.legend()
plt.show()
Borderline-SMOTE
Borderline-SMOTE method applies oversampling to border minority instances instead of applying it to all minority instances. A subset of samples for borderline instances called DANGER is created. SMOTE is applied then for each instance of the DANGER set.
Borderline SMOTE is achieved by using the BorderlineSMOTE() function in imbalance library.
# create an instance of BorderlineSMOTE class
oversample = BorderlineSMOTE()# oversample the data with fit_resample() function
# fit_resample(): takes the feature vector and minority vector and returns the oversampled vectorX, y = oversample.fit_resample(X, y)
counter = Counter(y)
# visualise the classes
for label, items in counter.items():
row = where(y == label)[0]
plt.scatter(X[row, 0], X[row, 1], label=str(label))
plt.legend()
plt.show()
Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN)
ADASYN uses a density function to decide about the number of synthetic instances that may be generated for each of the minority instance.
ADASYN() class provides the direct implementation in python to handle imbalance with oversampling the minority class.
# create an instance of ADASYN class
oversample = ADASYN()# oversample the data with fit_resample() function
# fit_resample(): takes the feature vector and minority vector and returns the oversampled vectorX, y = oversample.fit_resample(X, y)
counter = Counter(y)# visualise the classes
for label, items in counter.items():
row = where(y == label)[0]
plt.scatter(X[row, 0], X[row, 1], label=str(label))
plt.legend()
plt.show()
Takeaways
We discussed through this article how SMOTE is a useful method to improve recognition rates for the samples belonging to minority groups as observed in datasets which have underrepresented samples for malicious attacks in computer security , life-threatening diseases in medicine datasets, suspicious behaviour in social networks or monitoring systems. The entire code for the oversampling techniques can be found here.
Do you have any questions?
Kindly ask your questions via email or comments and we will be happy to answer.