Handling Imbalanced Dataset

akhil anand
Analytics Vidhya
Published in
4 min readDec 9, 2020
Source

What is Imbalanced Dataset ?

It is most commonly found in medical sector related dataset,fraudulent dataset etc. Suppose Apollo hospital has made a dataset of people came for diabetes checkup ,the dataset consists binary ouput that is either person will be diabetic or not.

let’s say out of 1000 records 100 people are diabetic and rest are normal, so according to output our dataset has been divided into two parts.

Person is diabetic =100 and person is non diabetic =900 here large amount of dataset has been inclined towards a particular class (negative class) hence it leads to formation of imbalanced dataset.

In this blog we will discuss various techniques to balance the imbalanced dataset.let’s get started……

1. Under-sampling :

In undersampling we reduce the majority class in such a way that it will be equal to minority class.

figure 1

According to figure1 previously we had imbalanced dataset of 900 datapoints in majority class and 100 datapoints in minority class.By doing downsampling we have reduced the datapoints of majority class as equal to minority class.

Disadvantage :

Reducing datapoints from majority class may leads to loss in useful information won’t give better result.

For programatic understanding i have taken credit card fraud detection dataset .

Step 1 : checking whether dataset is balanced or imbalanced

sns.countplot(df["Class"],orient="V")
plt.show()
--------------------------------------------------------------------
fraud=df[df["Class"]==1]
normal=df[df["Class"]==0]
print(fraud.shape) #printing shape of class 1
print(normal.shape) #Printing shape of class 0
out 1
out 2

Step 2 : perform the opreation

As we see there is abundant amount of data present in class 0 as compared to class 1. Hence we can say that data is imbalanced. Now we will balance the dataset using library imblearn this library might not be pre installed into jupyter notebook so you need to install it first by pip install imblearn.

We have class 1=492 (minority class) and class 0=284315(majority class) by using undersampling we will reduce down the count of majority as same as count of minority minority class.

After undersampling we will have number of values in class 1=492 and class 0=492 .

x=df.iloc[:,:-1] #creating inependent variable
y=df.iloc[:,-1] #creating target variable
#code for undersampling
from imblearn.under_sampling import NearMiss
nm=NearMiss()
#Resampling of data of independent variable xand target variable y x_res,y_res=nm.fit_sample(x,y)
print(x_res.shape)
print(y_res.shape)
[out]>> (984, 30)
(984,)

Plotting countplot chart to see whether the dataset has been balanced or not

df1=pd.DataFrame(x_res)
df1["Class"]=y_res
sns.countplot(df1["Class"])
Output countplot of balanced dataset

2. Over-Sampling :

In oversampling we add more and more datapoints in minority class and make it’s datapoints equal to majority class . This is most commonly used balancing technique in machine learning it won’t loose information during balancing the dataset. There are various oversampling techniques we can use to balance the majority and minority class.When we do upsampling we always will have chance of overfitting.

figure 2

Method 1 : class weight

Suppose you have 900 datapoints of class 1 and 100 dataset of class 0 .

Step1 : Take the ratio of datapoints present in both the classes

ratio = 100÷ 900 = 1÷ 9 =>1:9

The weight of majority class will be multiplied with each and every datapoints of minority class and weight of minority class will be mutliplied with each and every datapoints of majority class which will be resulted into balanced dataset.

Step 2 :

cross multiplication

Now our minority class will be equal to majority class hence data is balanced now.

Method 2 :- Artifical or synthetic points method

By using extraplolation technique we create more and more synthetic points in the minority class until it will be equal to points present in majority class.

Python implementation :-

-------------------------------------------------------------------
from imblearn.combine import SMOTETomek
st=SMOTETomek()
x_res,y_res=st.fit_sample(x,y)
print(x_res.shape)
print(y_res.shape)
[out]>> (567542, 30)
(567542,)
--------------------------------------------------------------------
from collections import Counter
print("Shape of y before undersampling",Counter(y))
print("Shape of y after undersampling",Counter(y_res))
[out]>> Shape of y before undersampling Counter({0: 284315, 1: 492})
Shape of y after undersampling Counter({0: 283771, 1: 283771})
--------------------------------------------------------------------

We can implement the over sampling technique by another method also that is;

from imblearn.over)sampling import RandomOverSample
rsp=RandomOverSample()
x_rsp,y_rsp=rsp.fit_sample(x,y)
print(x_rsp.shape)
print(y_rsp.shape)
[out] >> (568630, 30)
(568630,)

Visualizing whether we are able to balance dataset or not

df2=pd.DataFrame(x_rsp)
df2["Class"]=y_rsp
sns.countplot(df2["Class"])
Visualization of balanced dataset

Conclusion :

This is all from my side if you find this blog interesting hang tight i will be coming with more interesting .Please give your valuable suggestion in the comment box. Keep learning keep exploring …………

--

--