Relative effects of up-sampling and down-sampling on an imbalanced dataset
A case study of online marketing purchase prediction
Facing with imbalanced data sets are not so occasional in terms of classification problems. In this article we will first examine a real world problem of classifying the visitors of an online shopping platform as purchasers and non-purchasers and discuss which solution works best in various aspects of performance.
An insight into the problem
It is very desirable to distinguish real purchasers from the crowd of observers for online shopping platforms to succeed in their marketing efforts and accomplish their annual goals in their balance sheets. Success in these predictions can increase profits via enabling dynamic pricing with occasional discount to encourage purchases and dealing with the possible purchasers personally rather than wasting time with idle onlookers.
As we will behold within a few minutes the portion of dataset[1] we study on has a characteristic class imbalance problem, since the vast majority of the entries (84.5%) are non-purchasers.
Getting started for the model training through preprocessing
For the sake of simplicity and to focus on class imbalance issue and to resample the dataset, we will just stick to a basic preprocessing and feature engineering practice. It is a topic for another article and there are already many valuable sources on TDS regarding this concept.
As you can see, the positive samples in the data frame is quite scarce.
Although the features mainly numeric there are still a couple of non-numeric columns.
Let’s first convert Month and VisitorType columns from object type (string) into float via OneHotEncoder class and then scale all the numeric types in the range of [0,1] in terms of simplicity and computational advantages.
Since, the columns with bool type can automatically be translated into 0 or 1 during scaling using MinMaxScaler.
Model Training via Random Forest Classifier
As the classifier to train our predictive model we will employ Random Forest Classifier (RFC)[2]. Except for being one of my favourites, there is no special reason for this selection. Hopefully, we will investigate how to determine the most appropriate and efficient choice among a collection of possible classifiers in a future article.
For now, let’s train our model based on our yet imbalanced dataset by utilizing Random Forest Classifier and examine the results:
Just to indicate the functionality of the parameters, we will add n_jobs=-1, so that during the training we will activate and make use of all the cores of our CPU and complete the training faster.
The other parameters such as max_depth, max_features, min_samples_leaf and min_samples_split are used to limit the in depth branching of the trees in order to preclude an overfitting issue. Lastly, n_estimator parameter sets the number of trees in the RFC, since it consists of a set of Decision Trees.
Scores and Performance Metrics Output:
In this study, the precision is somehow more important and critical than recall score to avoid diversion by false positives. So we set beta (β) parameter to 0.5.[4]
It is fair to say that the performance metrics are not quite encouraging except the accuracy score which can be extremely misleading at times, especially when dealing with class imbalance.
Solution via Down-sampling the Data Set
Now we can apply down-sampling method and and compare the results. Down-sampling is simply to eliminate residuary part of the majority class samples to balance the sizes of positive and negative entries.
When we train the very same Random Forest Classifier, the performance metrics returns as the following:
The improvement in all the success criteria draws the attention. It goes without saying that down-sampling worked generating promising results.
Thus, why not to try up-sampling this time?
Solution via Up-sampling the Data Set
On the contrary to down-sampling, up-sampling is to populate new entries into the minority class samples with similar, relevant entries to compensate the shortfall and balance the sizes of positive and negative entries.
The performance metrics based on the up-sampled data are realized as the following:
The performance of this model is evidently outperforms the first RFC model and is even slightly better then the previous one in all aspects especially in terms of precision and recall scores.
Conclusion
Up/down sampling worked like a charm and significantly increased the prediction success of the rare class instances. The model trained out of the up-sampled dataset overperformed the one trained with the down-sampled data.
Based upon experience, up-sampling mostly better to a certain extent. With one exception; if we are dealing with a huge dataset and standing up to a memory scarcity[5], down-sampling is preferable since it shrinks the overall sample size.
The performance metrics obtained in this specific case study is not the best or upmost limits. It had been possible to acquire ratios ranging between 0.95 and 0.98 implementing further efforts in feature engineering and model selection and parameter refinement. Nonetheless, those are out of the scope of this article and hopefully are of another one.
In conclusion, up and down sampling is a good old trick to practice which increases the models performance with one touch in many cases of class imbalance alongside many other set of tools and methods.
Sources:
- https://unsplash.com/photos/_3Q3tsJ01nc
- https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html#sklearn.metrics.fbeta_score
- https://stats.stackexchange.com/questions/122409/why-downsample