Imbalanced vs Balanced Dataset in Machine Learning

Published in

Open Datascience

4 min readJul 2, 2019

Balanced Dataset :

Before giving you the definition of Balanced dataset let me give you an example for your better understanding, lets assume I have a dataset with thousand data points and I name it “N”. So now N = 1000 data points, & N have two different classes one is N1 and another one is N2. Inside the N1 there have 580 data points and inside the N2 there have 420 data points. N1 have positive (+Ve) data points and N2 have negative (-Ve) data points. So we can say that the number of data points of N1 and N2 is almost similar than each other. So then I can write N1 ~ N2. Then it is proved that N is a Balanced Dataset.
A balanced dataset is the one that contains an equal or almost equal number of samples from the positive and negative classes.

Imbalanced Dataset:

Before giving you the definition of Imbalanced dataset let me give you an example for your better understanding, lets assume I have a dataset with thousand data points and I name it “N”. So now N = 1000 data points, & N have two different classes one is N1 and another one is N2. Inside the N1 there have 900 data points and inside the N2 there have 100 data points. N1 have positive (+Ve) data points and N2 have negative (-Ve) data points. So we can say that the number of data points of N1 and N2 is not similar than each other. So then I can write N1 ≠ N2, then it is proved that N is an Imbalanced Dataset.

Imbalance data distribution is an important part of machine learning workflow. An imbalanced dataset means instances of one of the two classes is higher than the other, in another way, the number of observations is not the same for all the classes in a classification dataset.

How to handle Imbalanced Dataset:

Well there have some few methods to handle an Imbalanced Dataset but there also have some problems, I will briefly explain all of them in below, there have two different methods to handle an Imbalanced dataset.

Under-Sampling
Over-Sampling

1. Under-Sampling :

Let assume I have a dataset “N” with 1000 data points. And ‘N’ have two class one is n1 and another one is n2. These two classes have two different reviews Positive and Negative. Here n1 is a positive class (+Ve) and have 900 data points and n2 is a negative class (-Ve) and have 100 data points, so we can say n1 is a majority class because n1 have big amount of data points and n2 is a minority class because n2 have less number of data points. For handle this Imbalanced dataset I will create a new dataset called N’. Here I will take all (100)n2 datapoints as it is and I will take randomly (100)n1 datapoints and put into the dataset called N’. This is a sampling trick and its called Under-Sampling.

• Disadvantages of Under-Sampling:

Before Under-Sampling I had 1000 data points in N and after Under-Sampling I had only 200 data points in N’. Now I have some data points, and I have thrown around 80% of data points which is not good for getting a good model because 80% of the datasets is also an 80% important information.
So now we can write |N’|< |N |
This is the disadvantages of Under-Sampling, to solving this under-sampling problem we will introduce with a new method called Over-Sampling.

2. Over-Sampling:

When one class of data is the underrepresented minority class in the data sample, oversampling techniques may be used to duplicate these results for a more balanced amount of positive results in training. Oversampling is used when the amount of data collected is insufficient. A popular oversampling technique is SMOTE (Synthetic Minority Over-sampling Technique), which creates synthetic samples by randomly sampling the characteristics from occurrences in the minority class.

References:

Fig 3 source https://bit.ly/2XkMBcl
Fig 4 source https://bit.ly/2Jj9XFn
Fig 5 source https://bit.ly/2FNQ7kV