Oversampling to remove class imbalance using SMOTE

Ashesh Das
2 min readNov 13, 2019

--

The class imbalance problem is a very common one whenever we are solving some kind of a classification problem. And this is justified as well, because ideally the number of positive reviews are greater than the number of negative reviews (of course if the product is decent tough!)

But, this class imbalance problem can result in a far more grave situation when it comes to classification problems! Imagine we have a data set of 10000 entries, out of which 9900 are positive responses and only 100 are negative responses. Now, if we fit any standard classification algorithm on this data it is likely to have a very high accuracy. This is because, even if all the entries are tagged as positive, then also the accuracy would be 99%. Thus the model would fail to predict the negative reviews in the data set.

To handle this situation, over oversampling/ under sampling is used. In this article we would be discussing only about an oversampling technique called SMOTE.

First what is oversampling? Over sampling is a technique through which we take the minority class and try to create new samples so that it can match up to the length of the majority class. A diagram representing oversampling is given below:

Now let’s talk about SMOTE.

The full form of SMOTE is Synthetic Minority Oversampling Technique. What SMOTE essentially does is, it creates some synthetic data points for the minority class so that it matches up to the majority class.

How does SMOTE achieve this synthetic data point creation? SMOTE finds out ‘k’ nearest neighbors of a data point in the minority class. After the nearest data points have been identified, SMOTE then creates some synthetic data points on the lines joining the primary point and the neighbors so that these data points share the similar features/characteristics of the other minority data points.

How SMOTE works and creates data points is explained in the diagram given below:

Synthetic Sample generation using SMOTE

Thus, by using mode we can get rid of the class imbalance problem and ensure that our model learns from the patterns of a data set and not from the noise.

SMOTE can be implemented in Python and I have uploaded a practical example of implementing SMOTE in my Github link.

Please have a look.

Happy Learning :)

--

--