Published in

Analytics Vidhya

# How To Deal With Class Imbalance In Dataset

simple techniques to deal with the imbalance of data

What is class imbalance

class imbalance is the unequal distribution and variation of data in machine learning task where one class tend to have more values than the other classes or distributions. the class imbalance can occur as a result of undiverse and biased resampling of data which is due to the collection of data from a single geographical area, or as a result of the nature of the problem such as fraudulent transactions which tend to have a biased distribution of data between the classes because of less occurrence and prevalence of fraud in the overall transactions. imbalance occurs also because of the presence of minority class such as race, ethnicity, or tribe in the data set.

so let us start by uploading our dataset and then printing the value counts of the Sex column.

`df=pd.read_csv('titanic_train.csv')df.Sex.value_counts()`

Output [1]:

`male      577female    314Name: Sex, dtype: int64`

in the output above, the male class has more values than the female class due to an imbalance of data.

Dealing with class imbalanced in dataset
1-Over sampling the minority class
2- Under-sampling the majority class

Oversampling the minority class
in oversampling, we created more synthetic data for the minority class so that the minority class will have more data to match the majority class, this is recommendable if there are fewer data to work with.

In [ 1]:

`from sklearn.utils import resampleMale=df[df.Sex=='male']Female=df[df.Sex=='female']female_upsampled = resample(Female, replace = True, n_samples = len(Male), # Match the number  with the majority class random_state=20)df = pd.concat([Male, female_upsampled])`

let us now see the column

`df.Sex.value_counts()`

Output [2]:

`female    577male      577Name: Sex, dtype: int64`

now, both the number of female and male class are equal.

Under-sampling majority class
in this sampling method, we remove some data from the majority class to match the minority class, but this may affect the performance of the model if there are fewer data to work with.

`male_downsampled = resample(Male, replace = True, # Sample with replacement n_samples = len(Female), # Match number with the minority class random_state=20)df = pd.concat([women, men_downsampled])`

let us now also see the Sex column

`df.Sex.value_counts()`

Output [2]:

`female    314male      314Name: Sex, dtype: int64`

Conclusion

the oversampling and undersampling method is one of the very common and effective techniques in dealing with an imbalance of data, but there are other techniques that can be used such as synthetic minority oversampling technique(SMOTH).

https://medium.com/analytics-vidhya/natural-language-processing-nlp-and-process-modeling-in-precision-medicine-a55fa9ec9818

--

--

## More from Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

## Alamin Musa Magaga

Data Scientist |Full stack developer |AI/ML Researcher | @AngelHack student ambassador |Robotics and IOT |founder @Magtech_Dihub | pythonista