How to Handle Imbalanced Dataset ?
Before we proceed to the topic. Balanced dataset means target column of class A and class B should be in 50:50 ratio or 60:40 ratio.
When we have class A and B of 80:20 or 90:10 is considered as Imbalanced Dataset. If we have such dataset, the model will get biased and it will lead to Model Overfitting.
To avoid such situation, we try to sample the dataset.
- What is Sampling ?
Sampling meant to increase the minority class records or delete majority class records in order to make the dataset as Balanced Dataset.
Sampling could be applied to Binary or Multiclass Classification problems.
2. What are the techniques in Sampling ?
- Oversample
- Undersample
- Combining Oversample and Undersample
- Adding class weights
OVERSAMPLING:
- RandomOverSample : It duplicates the minority class records
- SMOTE : It resample the minority class records.
RandomOverSample :
- It randomly duplicates the records from the minority class.
- Say : target has 1 and 0.. {1: 5, 0: 2}. After Applying RandomOverSampling, the minority class 0 becomes {1: 5, 0: 5}
- To Understand, lets take a dataframe has few records with target as 0,1.
import pandas as pd
data = [['tom',10,1], ['nick',11, 1], ['juli',10, 0], ['tommy',11, 1], ['hilton',11, 1], ['Mark',10, 0], ['Ani',11, 1]]
df = pd.DataFrame(data, columns = ['Name','Class','Target'])
df
After Applying RandomOverSampling.
from imblearn.over_sampling import RandomOverSamplerX = df.iloc[:, 0:2].values
y = df.iloc[:, -1].valuesos = RandomOverSampler(sampling_strategy='minority')
X_new, y_new = os.fit_sample(X, y)
print('Original dataset shape {}'.format(Counter(y)))
print('Resampled dataset shape {}'.format(Counter(y_new)))********************** OUTPUT *****************
Original dataset shape Counter({1: 5, 0: 2})
Resampled dataset shape Counter({0: 5, 1: 5})
Class 0 has 5 records now after sampling. Lets look at the dataframe.
- RandomOverSampler(sampling_strategy=’minority’) Change the sampling strategy between 0.1 to 1, 0.5 means 50% of minority class gets duplicated. 0.8 means 80% of minority class gets duplicated.
SMOTE : Synthetic Minority Oversampling Technique
- It synthesize new examples from the minority class rather than taking duplicate records.
- SMOTE takes the k-nearest neighor and finds the nearest point in feature space. Then, it draws a line to connect all nearest neighbors and creates new points inside that line.
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 42)
X_res, y_res = sm.fit_resample(X_data, Y_data)
RandomUnderSample:
- It deletes the records from the majority class label to match the minority class label.
- The Limitation is most of the data are deleted. The deleted records might have a useful insight, or different patterns. By deteting we will lose major important records.
from imblearn.under_sampling import RandomUnderSamplerX = df.iloc[:, 0:2].values
y = df.iloc[:, -1].valuesos = RandomUnderSampler(sampling_strategy='majority')
X_new, y_new = os.fit_sample(X, y)
print('Original dataset shape {}'.format(Counter(y)))
print('Resampled dataset shape {}'.format(Counter(y_new)))************ OUTPUT ***************************
Original dataset shape Counter({1: 5, 0: 2})
Resampled dataset shape Counter({0: 2, 1: 2})
Combining OverSampling and UnderSampling :
- Its better to combine oversampling and undersampling together.
- First apply OverSampling on minority class labels by 50 % and then apply UnderSampling on majority class labels by 20 or 30%.
- By doing so, we might not lose major datapoints instead we lose 20 or 30% datapoints only.
# Perform Over Sampling
over = RandomOverSampler(sampling_strategy=0.1)
X, y = over.fit_resample(X, y)# Perform Under Samplingunder = RandomUnderSampler(sampling_strategy=0.5)
X, y = under.fit_resample(X, y)
Adding class weights :
Instead of adding or deleting datapoints, we can add more weights to the minority class labels.
We can add weight explicitly or simply specify specifying class_weight=”balanced”
from sklearn.utils import class_weight
class_weights = class_weight.compute_class_weight('balanced',np.unique(y),y)
print(np.unique(y),class_weights)[0 1] [1.75 0.7 ]
Assign 0:1.75 and 1:0.7 weight.
Example : RandomForest Classifier
RandomForestClassifier(n_estimators=50, class_weight = {0:0.5, 1:1})
(or)
RandomForestClassifier(n_estimators=50, class_weight =’balanced’)