Understanding Imbalanced Data: Effective Strategies for Handling Imbalance Data

Anaskhan
2 min readJun 7, 2023

--

⁍ Imbalance Data is a problem in Classification both in binary and multinomial.

⁍ Majority Class will be more than 70% and the minority class will be less than 30%. Model performance will be severely affected.

⁍ In Imbalance Data Classification Accuracy of the model will be good but Sensitivity/Recall, Precision, and F1 Score of the Minority class is worst. Even AUC will be bad for Imbalance Data Model.

⁍ Classification report must be checked in case of Imbalance Data.

➼ Dealing with Imbalance data — Sampling Methods must be used. Oversampling and undersampling.

1) Random Oversampling is where observations from minority classes are duplicated.

2) Random Undersampling is where observations from the majority class are deleted.

Undersampling is used when there is large-scale data

Oversampling is used when the data size is small

researchgate.net
# Using UnderSampler
from imblearn.under_sampling import RandomUnderSampler
rus=RandomUnderSampler(random_state=42)
X_rus,y_rus=rus.fit_resample(X,y)
logit_rus=LogisticRegression(max_iter=1000)
logit_rus_model=logit_rus.fit(X_rus,y_rus)
logit_rus_model.score(X_rus,y_rus)
logit_rus_pred=logit_rus_model.predict(X_rus)
print(classification_report(y_rus,logit_rus_pred))
RocCurveDisplay.from_predictions(y_rus,logit_rus_pred)
# Using OverSampler
from imblearn.over_sampling import RandomOverSampler
ros=RandomOverSampler(random_state=42)
X_ros,y_ros=ros.fit_resample(X,y)
logit_ros=LogisticRegression(max_iter=1000)
logit_ros_model=logit_ros.fit(X_ros,y_ros)
logit_ros_model.score(X_ros,y_ros)
logit_ros_pred=logit_ros_model.predict(X_ros)
print(classification_report(y_ros,logit_ros_pred))

¤ SMOTE — Synthetic Minority Oversampling Technique

  • SMOTE uses KNN Algorithm(Euclidean Distance) and creates artificial or synthetic data that lies within the range.
  • No outliers are created.
  • SMOTE also uses a Random number generator for generating random weights between 0 and 1

Ex: Two Independent Variables: X1 - Income and X2 - Age

  • X1–2400,2500,2700,2300,2100, 2440
  • X2–46, 34, 45, 28, 25, 41
  • Choose a randomly selected weight between 0 and 1. Randomly selected 0.60
  • 2500 +0.60*(2400–2500)= 2440 (Synthetic Data point)
  • 34 +0.60*(46–34) = 41.20 or 41 (Synthetic Data Point)

SMOTE: Works only on Numeric Data

SMOTENC: For Both Numeric and Non-Numeric Categorical Data

SMOTEN: Only for Non-Numeric Categorical Data

from imblearn.over_sampling import SMOTEN
smote=SMOTEN(random_state=42)
X_smote,y_smote=smote.fit_resample(X,y)
pd.DataFrame(y_smote).value_counts()
logit_smote=LogisticRegression(max_iter=1000)
logit_smote_model=logit_smote.fit(X_smote,y_smote)
logit_smote_model.score(X_smote,y_smote)
logit_smote_pred=logit_smote_model.predict(X_smote)
print(classification_report(y_smote,logit_smote_pred))

¤ Undersampling Technique: Tomek Links are pairs of observations of opposite classes in close vicinity.

  • In this algorithm, the majority of class observations are deleted from the Tomek link, which provides a better decision for a classifier.
mlwhiz.com
from imblearn.under_sampling import TomekLinks
tomek=TomekLinks(sampling_strategy="majority")
X_tomek,y_tomek=tomek.fit_resample(X,y)
pd.DataFrame(y_tomek).value_counts()
logit_tomek=LogisticRegression(max_iter=3000)
logit_tomek_model=logit_tomek.fit(X_tomek,y_tomek)
logit_tomek_model.score(X_tomek,y_tomek)
logit_tomek_pred=logit_tomek_model.predict(X_tomek)
print(classification_report(y_tomek,logit_tomek_pred))

--

--