How To Deal With Class Imbalance In Dataset
A simple techniques to deal with the imbalance of data
What is class imbalance
class imbalance is the unequal distribution and variation of data in machine learning task where one class tend to have more values than the other classes or distributions. the class imbalance can occur as a result of undiverse and biased resampling of data which is due to the collection of data from a single geographical area, or as a result of the nature of the problem such as fraudulent transactions which tend to have a biased distribution of data between the classes because of less occurrence and prevalence of fraud in the overall transactions. imbalance occurs also because of the presence of minority class such as race, ethnicity, or tribe in the data set.
so let us start by uploading our dataset and then printing the value counts of the Sex column.
df=pd.read_csv('titanic_train.csv')
df.Sex.value_counts()
Output [1]:
male 577
female 314
Name: Sex, dtype: int64
in the output above, the male class has more values than the female class due to an imbalance of data.
Dealing with class imbalanced in dataset
1-Over sampling the minority class
2- Under-sampling the majority class
Oversampling the minority class
in oversampling, we created more synthetic data for the minority class so that the minority class will have more data to match the majority class, this is recommendable if there are fewer data to work with.
In [ 1]:
from sklearn.utils import resample
Male=df[df.Sex=='male']
Female=df[df.Sex=='female']
female_upsampled = resample(Female,
replace = True,
n_samples = len(Male), # Match the number with the majority class
random_state=20)
df = pd.concat([Male, female_upsampled])
let us now see the column
df.Sex.value_counts()
Output [2]:
female 577
male 577
Name: Sex, dtype: int64
now, both the number of female and male class are equal.
Under-sampling majority class
in this sampling method, we remove some data from the majority class to match the minority class, but this may affect the performance of the model if there are fewer data to work with.
male_downsampled = resample(Male,
replace = True, # Sample with replacement
n_samples = len(Female), # Match number with the minority class
random_state=20)
df = pd.concat([women, men_downsampled])
let us now also see the Sex column
df.Sex.value_counts()
Output [2]:
female 314
male 314
Name: Sex, dtype: int64
Conclusion
the oversampling and undersampling method is one of the very common and effective techniques in dealing with an imbalance of data, but there are other techniques that can be used such as synthetic minority oversampling technique(SMOTH).