How to Deal with Imbalanced Data using SMOTE

With a Case Study in Python

Khyati Mahendru
Analytics Vidhya
3 min readJun 25, 2019

--

With libraries like scikit-learn at our disposal, building classification models is just a matter of minutes. However, building models without properly examining the structure of your data can lead to disastrous results.

Imagine that you are a doctor. Sadly, you have discovered a tumor in one of your patients. And you have to investigate if it is cancerous or not. To give a primary diagnosis, you note down some features of the tumor and feed it to your trusted classification model for prediction.

What if your model predicts that the tumor is benign while in reality, it is cancerous? Tragic! There is a huge cost of this wrong prediction — it might delay treatment and even lead to the death of the patient.

The problem with Imbalanced Data

In classification problems, balancing your data is absolutely crucial. Data is said to be imbalanced when instances of one class outnumber the other(s) by a large proportion.

Feeding imbalanced data to your classifier can make it biased in favor of the majority class, simply because it did not have enough data to learn about the minority.

There are several sampling methods to deal with this. You can (read should) check out the articles below to learn about all of them in detail:

In this article, I will discuss one of the sampling techniques — Synthetic Minority Over-Sampling Technique, abbreviated as SMOTE.

What is SMOTE?

Just like the name suggests, the technique generates synthetic data for the minority class.

SMOTE proceeds by joining the points of the minority class with line segments and then places artificial points on these lines.

Under the hood, the SMOTE algorithm works in 4 simple steps:

  1. Choose a minority class input vector
  2. Find its k nearest neighbors (k_neighbors is specified as an argument in the SMOTE() function)
  3. Choose one of these neighbors and place a synthetic point anywhere on the line joining the point under consideration and its chosen neighbor
  4. Repeat the steps until data is balanced

SMOTE is implemented in Python using the imblearn library.

I would recommend reading the documentation for SMOTE to get acquainted with its various parameters.

Credit Card Fraud Detection: SMOTE in Python

We will work with this data available at Kaggle.

The Python notebook may take time to render. You can also view it here.

End Notes

There are many sampling techniques for balancing data. SMOTE is just one of them. But, there’s no single best technique. Generally, you need to experiment with a few of them before deciding on one. Make sure to check out the resources I attached above to learn about all the sampling techniques.

In the end, I leave you with a challenge. Try improving the accuracy of the classifier by using different classification algorithms in combination with SMOTE. Let me know of your results down in the comments.

--

--