How SMOTE Oversampling Improved the Accuracy on Imbalanced Classification with Python (Part I)

Xue Susan Chen
2 min readMar 23, 2020

--

To model the real-world data, one of the challenges is the imbalance among different classes, in some cases, it is extremely imbalanced, which makes modeling almost impossible without some manipulation. Sampling can be used to deal with this issue, the Synthetic Minority Over-sampling tEchnique(SMOTE) is one of the most commonly methods. It aims to balance class distribution by randomly replicating samples in minority. In this blog I will demonstrate how I used SMOTE and Random Forest to predict whether income exceeds $50K/yr based on census data.

The data was obtained from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/adult). It is also known as “Census Income” dataset. During data preprocess steps, column ‘native-country’ was dropped because 90% rows have the same values. Outliers were identified in two columns ‘capital-gain’ and ‘hours-per-week’ because values are unreasonably high due to rare case or mistake.

Then use one-hot end coding to convert categorical variables to variables can calculate for machine learning model. The test_size = 0.33. The baseline for training set is 76.3% and baseline for testing set is 80.5%.The machine learning model, RandomForest classification model was used to predict the income class. The hyperparameter tuning was completed before.

Comparing to the baseline, the accuracy increased significantly. In order to improve the accuracy further, I introduced SMOTE to balance the dataset. The code is listed below:

Conclusion:

From the above table we can see that when balancing dataset with SMOTE, the accuracy of training increases from 88.25% to 91.61%, the accuracy of testing didn’t show much changes. Will SMOTE work well on even more imbalanced dataset? I will discuss it in the next blog: How SMOTE Oversampling Improved the Accuracy on Imbalanced Classification with Python (Part II)

--

--