Bank Data: SMOTE

Zaki Jefferson
Analytics Vidhya
Published in
2 min readAug 31, 2020

This will be a short post before we dive deep into classification in the next few blog posts.

If we look back on the banking data we will see that the dependent variable is heavily imbalanced. We can check the value counts by using the code below, and we can also get a visual representation using Seaborn’s count plot.

# Dependent variable is imbalanced
y_train.value_counts(normalize=True)
sns.countplot(y_train)

The images above shows us that there are more values in one class than the other. This can cause trouble in future machine learning models that will be used later; imbalanced data can cause a high bias problem, creating more Type 1 errors and/or Type 2 errors.

SMOTE

SMOTE, Synthetic Minority Oversampling Technique, will allow us to properly balance our data in order to aid in future machine learning algorithms.

# Imbalance data
from imblearn.over_sampling import SMOTE
# Intantiating Synthetic Minority Over Sampling Technique to balance target variable
sm = SMOTE(random_state=19)
X_train_new, y_train_new = sm.fit_sample(X_train, y_train)sns.countplot(y_train_new)

The code above shows us fitting the training data to the SMOTE object, sm. Using Seaborn’s count plot will show us the new value counts of our dependent variable.

And it looks like we have an even amount on both classes.

--

--

Zaki Jefferson
Analytics Vidhya

Data Scientist | Data Science Consultant. I work with companies and individuals to help leverage the abundance of data to help grow their ideas and business!