How to Handle Imbalanced Dataset ?

r.aruna devi
Analytics Vidhya
Published in
4 min readMay 27, 2020
Image from Google

Before we proceed to the topic. Balanced dataset means target column of class A and class B should be in 50:50 ratio or 60:40 ratio.

When we have class A and B of 80:20 or 90:10 is considered as Imbalanced Dataset. If we have such dataset, the model will get biased and it will lead to Model Overfitting.

To avoid such situation, we try to sample the dataset.

  1. What is Sampling ?
Image from Google.

Sampling meant to increase the minority class records or delete majority class records in order to make the dataset as Balanced Dataset.

Sampling could be applied to Binary or Multiclass Classification problems.

2. What are the techniques in Sampling ?

  • Oversample
  • Undersample
  • Combining Oversample and Undersample
  • Adding class weights

OVERSAMPLING:

  • RandomOverSample : It duplicates the minority class records
  • SMOTE : It resample the minority class records.

RandomOverSample :

  • It randomly duplicates the records from the minority class.
  • Say : target has 1 and 0.. {1: 5, 0: 2}. After Applying RandomOverSampling, the minority class 0 becomes {1: 5, 0: 5}
  • To Understand, lets take a dataframe has few records with target as 0,1.
import pandas as pd 
data = [['tom',10,1], ['nick',11, 1], ['juli',10, 0], ['tommy',11, 1], ['hilton',11, 1], ['Mark',10, 0], ['Ani',11, 1]]
df = pd.DataFrame(data, columns = ['Name','Class','Target'])
df
sd
juli and Ani has 0..Now 1 occurs 5times and 0 occurs 2 times.

After Applying RandomOverSampling.

from imblearn.over_sampling import RandomOverSamplerX = df.iloc[:, 0:2].values
y = df.iloc[:, -1].values
os = RandomOverSampler(sampling_strategy='minority')
X_new, y_new = os.fit_sample(X, y)
print('Original dataset shape {}'.format(Counter(y)))
print('Resampled dataset shape {}'.format(Counter(y_new)))
********************** OUTPUT *****************
Original dataset shape Counter({1: 5, 0: 2})
Resampled dataset shape Counter({0: 5, 1: 5})

Class 0 has 5 records now after sampling. Lets look at the dataframe.

  • RandomOverSampler(sampling_strategy=’minority’) Change the sampling strategy between 0.1 to 1, 0.5 means 50% of minority class gets duplicated. 0.8 means 80% of minority class gets duplicated.

SMOTE : Synthetic Minority Oversampling Technique

  • It synthesize new examples from the minority class rather than taking duplicate records.
  • SMOTE takes the k-nearest neighor and finds the nearest point in feature space. Then, it draws a line to connect all nearest neighbors and creates new points inside that line.
Red circle are the original points and the dotted line is the line drawn after taking n-nearest neighbors and green circle denotes the new data points which is created newly from the dotted-line.
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 42)
X_res, y_res = sm.fit_resample(X_data, Y_data)

RandomUnderSample:

  • It deletes the records from the majority class label to match the minority class label.
  • The Limitation is most of the data are deleted. The deleted records might have a useful insight, or different patterns. By deteting we will lose major important records.
from imblearn.under_sampling import RandomUnderSamplerX = df.iloc[:, 0:2].values
y = df.iloc[:, -1].values
os = RandomUnderSampler(sampling_strategy='majority')
X_new, y_new = os.fit_sample(X, y)
print('Original dataset shape {}'.format(Counter(y)))
print('Resampled dataset shape {}'.format(Counter(y_new)))
************ OUTPUT ***************************
Original dataset shape Counter({1: 5, 0: 2})
Resampled dataset shape Counter({0: 2, 1: 2})

Combining OverSampling and UnderSampling :

  • Its better to combine oversampling and undersampling together.
  • First apply OverSampling on minority class labels by 50 % and then apply UnderSampling on majority class labels by 20 or 30%.
  • By doing so, we might not lose major datapoints instead we lose 20 or 30% datapoints only.
# Perform Over Sampling
over = RandomOverSampler(sampling_strategy=0.1)
X, y = over.fit_resample(X, y)
# Perform Under Samplingunder = RandomUnderSampler(sampling_strategy=0.5)
X, y = under.fit_resample(X, y)

Adding class weights :

Instead of adding or deleting datapoints, we can add more weights to the minority class labels.

We can add weight explicitly or simply specify specifying class_weight=”balanced”

from sklearn.utils import class_weight
class_weights = class_weight.compute_class_weight('balanced',np.unique(y),y)
print(np.unique(y),class_weights)
[0 1] [1.75 0.7 ]

Assign 0:1.75 and 1:0.7 weight.

Example : RandomForest Classifier

RandomForestClassifier(n_estimators=50, class_weight = {0:0.5, 1:1})

(or)

RandomForestClassifier(n_estimators=50, class_weight =’balanced’)

--

--