Relative effects of up-sampling and down-sampling on an imbalanced dataset

--

A case study of online marketing purchase prediction

(unsplash.com)[2]

Facing with imbalanced data sets are not so occasional in terms of classification problems. In this article we will first examine a real world problem of classifying the visitors of an online shopping platform as purchasers and non-purchasers and discuss which solution works best in various aspects of performance.

An insight into the problem

It is very desirable to distinguish real purchasers from the crowd of observers for online shopping platforms to succeed in their marketing efforts and accomplish their annual goals in their balance sheets. Success in these predictions can increase profits via enabling dynamic pricing with occasional discount to encourage purchases and dealing with the possible purchasers personally rather than wasting time with idle onlookers.

As we will behold within a few minutes the portion of dataset[1] we study on has a characteristic class imbalance problem, since the vast majority of the entries (84.5%) are non-purchasers.

Getting started for the model training through preprocessing

For the sake of simplicity and to focus on class imbalance issue and to resample the dataset, we will just stick to a basic preprocessing and feature engineering practice. It is a topic for another article and there are already many valuable sources on TDS regarding this concept.

import libraries, retrieve dataset and inspect class distribution

As you can see, the positive samples in the data frame is quite scarce.

Although the features mainly numeric there are still a couple of non-numeric columns.

Let’s first convert Month and VisitorType columns from object type (string) into float via OneHotEncoder class and then scale all the numeric types in the range of [0,1] in terms of simplicity and computational advantages.

Since, the columns with bool type can automatically be translated into 0 or 1 during scaling using MinMaxScaler.

converts raw data to numerical columns and does train-test split

Model Training via Random Forest Classifier

As the classifier to train our predictive model we will employ Random Forest Classifier (RFC)[2]. Except for being one of my favourites, there is no special reason for this selection. Hopefully, we will investigate how to determine the most appropriate and efficient choice among a collection of possible classifiers in a future article.

For now, let’s train our model based on our yet imbalanced dataset by utilizing Random Forest Classifier and examine the results:

Just to indicate the functionality of the parameters, we will add n_jobs=-1, so that during the training we will activate and make use of all the cores of our CPU and complete the training faster.

The other parameters such as max_depth, max_features, min_samples_leaf and min_samples_split are used to limit the in depth branching of the trees in order to preclude an overfitting issue. Lastly, n_estimator parameter sets the number of trees in the RFC, since it consists of a set of Decision Trees.

trains first Random Forest Classifier model with not the original dataset entries

Scores and Performance Metrics Output:

In this study, the precision is somehow more important and critical than recall score to avoid diversion by false positives. So we set beta (β) parameter to 0.5.[4]

It is fair to say that the performance metrics are not quite encouraging except the accuracy score which can be extremely misleading at times, especially when dealing with class imbalance.

output for performance metrics of first model

Solution via Down-sampling the Data Set

Now we can apply down-sampling method and and compare the results. Down-sampling is simply to eliminate residuary part of the majority class samples to balance the sizes of positive and negative entries.

Down-sampling method: eliminates the excess part in majority portion of the samples down to the number of minority portion

When we train the very same Random Forest Classifier, the performance metrics returns as the following:

output for performance metrics of second model (trained with down-sampled data)

The improvement in all the success criteria draws the attention. It goes without saying that down-sampling worked generating promising results.

Thus, why not to try up-sampling this time?

Solution via Up-sampling the Data Set

On the contrary to down-sampling, up-sampling is to populate new entries into the minority class samples with similar, relevant entries to compensate the shortfall and balance the sizes of positive and negative entries.

Up-sampling method: populates new entries into minority portion of the samples up to the size of majority portion
trains a RFC model with up-sampled data and prints its performance metrics

The performance metrics based on the up-sampled data are realized as the following:

output for performance metrics of third model (trained with up-sampled data)

The performance of this model is evidently outperforms the first RFC model and is even slightly better then the previous one in all aspects especially in terms of precision and recall scores.

Performance metrics according to Data Sampling

Conclusion

Up/down sampling worked like a charm and significantly increased the prediction success of the rare class instances. The model trained out of the up-sampled dataset overperformed the one trained with the down-sampled data.

Based upon experience, up-sampling mostly better to a certain extent. With one exception; if we are dealing with a huge dataset and standing up to a memory scarcity[5], down-sampling is preferable since it shrinks the overall sample size.

The performance metrics obtained in this specific case study is not the best or upmost limits. It had been possible to acquire ratios ranging between 0.95 and 0.98 implementing further efforts in feature engineering and model selection and parameter refinement. Nonetheless, those are out of the scope of this article and hopefully are of another one.

In conclusion, up and down sampling is a good old trick to practice which increases the models performance with one touch in many cases of class imbalance alongside many other set of tools and methods.

Sources:

  1. https://unsplash.com/photos/_3Q3tsJ01nc
  2. https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset
  3. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
  4. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html#sklearn.metrics.fbeta_score
  5. https://stats.stackexchange.com/questions/122409/why-downsample

--

--

Caner Burç BAŞKAYA
Economics, Society and Data Science

software engineering student and an enthusiastic data science practitioner