How To Deal With Class Imbalance In Dataset

Alamin Musa Magaga
Analytics Vidhya
Published in
2 min readMay 19, 2021

A simple techniques to deal with the imbalance of data

Photo by Geert Pieters on Unsplash

What is class imbalance

class imbalance is the unequal distribution and variation of data in machine learning task where one class tend to have more values than the other classes or distributions. the class imbalance can occur as a result of undiverse and biased resampling of data which is due to the collection of data from a single geographical area, or as a result of the nature of the problem such as fraudulent transactions which tend to have a biased distribution of data between the classes because of less occurrence and prevalence of fraud in the overall transactions. imbalance occurs also because of the presence of minority class such as race, ethnicity, or tribe in the data set.

so let us start by uploading our dataset and then printing the value counts of the Sex column.

df=pd.read_csv('titanic_train.csv')
df.Sex.value_counts()

Output [1]:

male      577
female 314
Name: Sex, dtype: int64

in the output above, the male class has more values than the female class due to an imbalance of data.

Dealing with class imbalanced in dataset
1-Over sampling the minority class
2- Under-sampling the majority class

Oversampling the minority class
in oversampling, we created more synthetic data for the minority class so that the minority class will have more data to match the majority class, this is recommendable if there are fewer data to work with.

In [ 1]:

from sklearn.utils import resample
Male=df[df.Sex=='male']
Female=df[df.Sex=='female']
female_upsampled = resample(Female,
replace = True,
n_samples = len(Male), # Match the number with the majority class
random_state=20)
df = pd.concat([Male, female_upsampled])

let us now see the column

df.Sex.value_counts()

Output [2]:

female    577
male 577
Name: Sex, dtype: int64

now, both the number of female and male class are equal.

Under-sampling majority class
in this sampling method, we remove some data from the majority class to match the minority class, but this may affect the performance of the model if there are fewer data to work with.

male_downsampled = resample(Male,
replace = True, # Sample with replacement
n_samples = len(Female), # Match number with the minority class
random_state=20)
df = pd.concat([women, men_downsampled])

let us now also see the Sex column

df.Sex.value_counts()

Output [2]:

female    314
male 314
Name: Sex, dtype: int64

Conclusion

the oversampling and undersampling method is one of the very common and effective techniques in dealing with an imbalance of data, but there are other techniques that can be used such as synthetic minority oversampling technique(SMOTH).

https://alaminmusamagaga.medium.com/simple-way-to-create-a-machine-learning-app-with-flask-69a532663fd5

https://medium.com/analytics-vidhya/natural-language-processing-nlp-and-process-modeling-in-precision-medicine-a55fa9ec9818

--

--

Alamin Musa Magaga
Analytics Vidhya

Data Scientist | Developer | Embedded System Engineer | Zindi Ambassador | Omdena Kano Lead | Youth Opportunities Ambassador | CTO YandyTech