How To Deal With Class Imbalance In Dataset

Alamin Musa Magaga

Published in

Analytics Vidhya

2 min readMay 19, 2021

A simple techniques to deal with the imbalance of data

What is class imbalance

class imbalance is the unequal distribution and variation of data in machine learning task where one class tend to have more values than the other classes or distributions. the class imbalance can occur as a result of undiverse and biased resampling of data which is due to the collection of data from a single geographical area, or as a result of the nature of the problem such as fraudulent transactions which tend to have a biased distribution of data between the classes because of less occurrence and prevalence of fraud in the overall transactions. imbalance occurs also because of the presence of minority class such as race, ethnicity, or tribe in the data set.

so let us start by uploading our dataset and then printing the value counts of the Sex column.

df=pd.read_csv('titanic_train.csv')
df.Sex.value_counts()

Output [1]:

male      577
female    314
Name: Sex, dtype: int64

in the output above, the male class has more values than the female class due to an imbalance of data.

Dealing with class imbalanced in dataset
1-Over sampling the minority class
2- Under-sampling the majority class

Oversampling the minority class
in oversampling, we created more synthetic data for the minority class so that the minority class will have more data to match the majority class, this is recommendable if there are fewer data to work with.

In [ 1]:

from sklearn.utils import resample
Male=df[df.Sex=='male']
Female=df[df.Sex=='female']

female_upsampled = resample(Female,
 replace = True,
 n_samples = len(Male), # Match the number  with the majority class
 random_state=20)

df = pd.concat([Male, female_upsampled])

let us now see the column

df.Sex.value_counts()

Output [2]:

female    577
male      577
Name: Sex, dtype: int64

now, both the number of female and male class are equal.

Under-sampling majority class
in this sampling method, we remove some data from the majority class to match the minority class, but this may affect the performance of the model if there are fewer data to work with.

male_downsampled = resample(Male,
 replace = True, # Sample with replacement
 n_samples = len(Female), # Match number with the minority class
 random_state=20)

df = pd.concat([women, men_downsampled])

let us now also see the Sex column

df.Sex.value_counts()

Output [2]:

female    314
male      314
Name: Sex, dtype: int64

Conclusion

the oversampling and undersampling method is one of the very common and effective techniques in dealing with an imbalance of data, but there are other techniques that can be used such as synthetic minority oversampling technique(SMOTH).

https://alaminmusamagaga.medium.com/simple-way-to-create-a-machine-learning-app-with-flask-69a532663fd5

https://medium.com/analytics-vidhya/natural-language-processing-nlp-and-process-modeling-in-precision-medicine-a55fa9ec9818

How To Deal With Class Imbalance In Dataset

A simple techniques to deal with the imbalance of data

Written by Alamin Musa Magaga