SMOTE AND NEAR MISS IN PYTHON: MACHINE LEARNING IN IMBALANCED DATASETS

4 min readJun 4, 2018

What is an Imbalanced Dataset?

Imagine, you have two categories in your dataset to predict — Category-A and Category-B. When Category-A is higher than Category-B or vice versa, you have a problem of imbalanced dataset.

So how is this a problem?

Imagine in a dataset of 100 rows, Category-A is containing 90 records and Category-B is containing 10 records. You run a machine learning model and end up with 90% accuracy. You were excited until you checked the confusion matrix.

Here, Category-B is completely classified as Category-A and the model got away with an accuracy of 90%.

How do we fix this problem?

We will be discussing 2 of the common and simple ways to deal with this problem.

SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE is an over-sampling method. What it does is, it creates synthetic (not duplicate) samples of the minority class. Hence making the minority class equal to the majority class. SMOTE does this by selecting similar records and altering that record one column at a time by a random amount within the difference to the neighbouring records.

We will be diving into python to see how this works. If you want to read more on SMOTE, here is an original research paper titled: “SMOTE: Synthetic Minority Over-sampling Technique” written in 2002.

NearMiss

NearMiss is an under-sampling technique. Instead of resampling the Minority class, using a distance, this will make the majority class equal to minority class.

Let’s see these practically:

Dataset

I’ll be using Bank Marketing dataset from UCI. Here the column for prediction is “y” which says either yes or no for client subscription to term deposit. The full code is available on GitHub.

I have put the data in a variable called “bank”. And for the sake of simplicity, I’ve removed “poutcome” and “contact” column and dropped the NAs.

So, from 45,211 records, we are left with 43,193 records. For the next step, I’ve mapped yes and no to “1” and “0” respectively.

Now to the main part:

If we check:

bank.y.value_counts()

We get:

0 38172
1 5021
Name: y, dtype: int64

The dataset contains 38172 records of clients without term deposit subscription and only 5021 records of clients with term deposit subscription. Clearly an imbalanced dataset.

If we split the dataset and fit a Logistic Regression and check the accuracy score:

lr = LogisticRegression()lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)accuracy_score(y_test, y_pred)Out[1]: 0.8930456523752199

Great 89%! Now let us check the confusion matrix:

confusion_matrix(y_test, y_pred)array([[9371,  173],
       [ 982,  273]], dtype=int64)

We can clearly see, this is a bad model. This model is not able to classify clients who have subscription for term deposits. Let’s check the recall score:

recall_score(y_test, y_pred)
Out[1]: 0.21752988047808766

Clearly bad. This is expected, since the one category has more records than the other.

Applying SMOTE:

You might need to install imblearn package from your prompt / Terminal.

pip install imblearn

Now:

from imblearn.over_sampling import SMOTE

Before fitting SMOTE, let us check the y_train values:

y_train.value_counts()0    28628
1     3766
Name: y, dtype: int64

Let us fit SMOTE: (You can check out all the parameters from here)

smt = SMOTE()
X_train, y_train = smt.fit_sample(X_train, y_train)

Now let us check the amount of records in each category:

np.bincount(y_train)Out[48]: array([28628, 28628], dtype=int64)

Both categories have equal amount of records. More specifically, the minority class has been increased to the total number of majority class.

Now fitting the classifier again and testing we get an accuracy score of:

accuracy_score(y_test, y_pred)Out[1]: 0.8025743124363367

Whoa! We have reduced the accuracy. But let us check the confusion matrix anyway:

confusion_matrix(y_test, y_pred)array([[7652, 1892],
       [ 240, 1015]], dtype=int64)

This is a good model compared to the previous one. Recall is great.

recall_score(y_test, y_pred)Out[1]: 0.8087649402390438

I would go ahead with this model than using the previous model. Now let us check what happens if we use NearMiss.

Applying NearMiss:

Import NearMiss:

from imblearn.under_sampling import NearMiss

Fit NearMiss: (You can check all the parameters from here)

nr = NearMiss()
X_train, y_train = nr.fit_sample(X_train, y_train)

Now let us check the amount of records in each category:

np.bincount(y_train)array([3766, 3766], dtype=int64)

Here, the majority class has been reduced to the total number of minority class, so that both classes will have equal number of records.

Now let us fit the classifier again and test the model:

confusion_matrix(y_test, y_pred)
array([[5102, 4442],
       [ 162, 1093]], dtype=int64)accuracy_score(y_test, y_pred)
Out[1]: 0.573664228169275recall_score(y_test, y_pred)
Out[1]: 0.8709163346613545

This model is better than the first model because it classifies better. But since in this case, SMOTE is giving me a great accuracy and recall, I’ll go ahead and use that model! :)