SMOTE and ADASYN ( Handling Imbalanced Data Set )

Indresh Bhattacharyya
Coinmonks
4 min readAug 3, 2018

--

Recently I was working on a project where the data set I had was completely imbalanced. It was a binary classification problem and the ratio of classes 0 and 1 was 99:1.

What happens when the data set is imbalanced ?

If the data set is imbalanced the model will be Biased. Think of it like this if you are feeding the model only 0 for every possible combination it will give you a 0 for every set of input.

How do we Know if the model is imbalanced or not?

  1. We check the count of the dependent categorical values the ratio should be 10:1 for the data set to be considered as a balanced Data set.

2. Confusion matrix : After the prediction is done check the confusion matrix.

[[TRUE POSITIVE] [ FALSE POSITIVE ]

[FALSE NEGATIVE][TRUE NEGATIVE]]

if any of the values become 0. Well your model is Biased and your data set is imbalanced

3. Not a good opinion but I also checked the count of the predicted variable, got from testing the model on test Data. If the count is 0 or 1, the Your model is considered to be biased.

Ways to Handle Imbalanced data set:

Under sampling: In this method basically we downsize the actual data set in such a way that the ratio of the dependent categories become 10:1. Some of the ways to do under sampling is :

a.Condensed Nearest-Neighbor

b. One-Sided Selection

See the list for more details and more algorithms:

My personal opinion is not to use Under sampling.

Why?

The reason is that we are actually reducing the data set thus giving the model less data to feed on. I will give you a example we have a data set of 10000 data and there are only 100 data points for 1 while others are 0. Now after performing under sampling we are basically reducing the data set to 1100 samples where 1000 are 0 and 100 are 1 . So we are getting rid of almost 9000 samples and feeding it to the model. So the model is more prone to error. Though different models use different methods to do under sampling the end result is the same that is it has less number of rows in the data set.

NOTE: I am discussing only the possibility not saying one is better than the other . This is my personal opinion. And we are not going to discuss it here.

2. Over sampling: This method uses synthetic data generation to increase the number of samples in the data set.

Goal: Increase the minority class so that the data set becomes balanced by creating synthetic observations based upon the existing minority observations.

SMOTE:

What smote does is simple. First it finds the n-nearest neighbors in the minority class for each of the samples in the class . Then it draws a line between the the neighbors an generates random points on the lines.

img url: https://www.researchgate.net/publication/287601878/figure/fig1/AS:316826589384744@1452548753581/The-schematic-of-NRSBoundary-SMOTE-algorithm.png

See the above image so it finds the 5 nearest neighbors to the sample points. then draws a line to each of them. Then create samples on the lines with class == minority class.

ADASYN:

Its a improved version of Smote. What it does is same as SMOTE just with a minor improvement. After creating those sample it adds a random small values to the points thus making it more realistic. In other words instead of all the sample being linearly correlated to the parent they have a little more variance in them i.e they are bit scattered.

OK! So its implementation time: So 1st we have to do KNN then creating Synthetic points. JUST JOKING ! SCIKIT-LEARN HAS A PACKAGE FOR IT ALREADY, THAT IS ‘imbalanced-learn’

Installation

pip install -U imbalanced-learn

Conda Install :

conda install -c conda-forge imbalanced-learn

def makeOverSamplesSMOTE(X,y):
#input DataFrame
#X →Independent Variable in DataFrame\
#y →dependent Variable in Pandas DataFrame format
from imblearn.over_sampling import SMOTE
sm = SMOTE()
X, y = sm.fit_sample(X, y)
return X,y

Note:

#X →Independent Variable in Pandas DataFrame
#y →dependent Variable in Pandas DataFrame format

def makeOverSamplesADASYN(X,y):
#input DataFrame
#X →Independent Variable in DataFrame\
#y →dependent Variable in Pandas DataFrame format
from imblearn.over_sampling import ADASYN
sm = ADASYN()
X, y = sm.fit_sample(X, y)
return(X,y)

Note:

#X →Independent Variable in Pandas DataFrame
#y →dependent Variable in Pandas DataFrame format

--

--