Properly Using SMOTE

Tales of a Learning Experience

Andrew Ozbun
Nerd For Tech
3 min readJul 5, 2021

--

Learning anything new is just that, a learning process. You don’t know until you know. It’s always insightful to look back at work from when you are first learning something. Laughing at yourself with that internal dialogue of “What the heck was I thinking?? This is terrible!!”

I was recently having coffee with data scientist who had a lot more experience than I. He looked through my notebooks and was overall very complementary. At one point, however, he wrinkled his face and asked “Why did you use SMOTE for both the training and testing sets on your models?” After a second I replied “Because… uh … I don’t know”.

I took his advice and used the SMOTE data set for my training and the original data set for my testing. Doing this improved the accuracy of my models as well as leaving me with some new found wisdom about this technique. In this article I want to go over what SMOTE is, when it is used, and how to use it properly.

What is SMOTE?

SMOTE stands for Synthetic Minority Oversampling TEchnique. It is similar to bootstrapping techniques in statistics, resampling the known population to generate synthetic data points to represent the unknown population.

When is SMOTE used?

Many times in real world data we deal with class imbalance. Class imbalance happens with categorical variables. Many times this is when we are performing binary classification, classifying a variable as 1 or 0.

How is it Used Properly?

When using SMOTE you CANNOT use it for both the training and testing set. This, for me, was the magical and elusive secret to using this method. The theory behind this is that it is proper inflate you training set because it helps the model learn what it is doing. Once you are testing, however, you have to use real world data. If the model is tested on augmented data then it is not a representation of what is happening in the real world. Therefore, you train on SMOTE data and test on reserved original data. You can see in the code below where I used training sets as X_train_sm for the model and testing sets as X_test for the classification report. In this specific model I was using a time series based data set so you will notice I set shuffle to False so that the model uses data sequentially.

Conclusion

Messing up and getting messy is a part of learning. I am at a point in my life where I am no longer ashamed of my mistakes, but I embrace them. I spent an entire day going through my Jupyter Notebooks correcting SMOTE mistakes. In some cases it even brought down the accuracy of my models, however it was true and real. Making my work reflect real life was humbling and insightful. Learning through our mistakes is the only way to grow.

--

--