SMOTE-NC in ML Categorization Models for Imbalanced Datasets

Fernando Aguilar
Analytics Vidhya
Published in
6 min readOct 9, 2019

Introduction

For this project I used the Online Shoppers Purchasing Intention Dataset Data Set, obtained from the UCI Machine Learning repository. The goal was to build a predictive machine learning model that could categorize users as either, revenue generating and non-revenue generation based on their behavior while navigating a website.

This blogpost will focus on SMOTE-NC, and its effect on the machine learning models’ scores used to categorize the data. Not much detail is going to be discussed on exploratory data analysis or data transformation techniques used in this project. If you would like to see the whole notebook to have more details on those, please follow the following link:

https://github.com/feraguilari/dsc-mod-5-project-online-ds-pt-021119/blob/master/student.ipynb

The Dataset

The dataset contains 18 columns, a total of which 17 are features and 1 is the target variable, in this instance, ‘Revenue’. Below is a description of what each column means in the dataset:

  • Administrative: Number of ‘administrative’ pages viewed
  • Administrative_Duration: Time spent looking at ‘administrative’ pages
  • Informational: Number of ‘informational’ pages viewed
  • Informational_Duration: Time spent looking at ‘informational’ pages
  • ProductRelated: Number of ‘product related’ pages viewed
  • ProductRelated_Duration: Time spent looking at ‘product related’ pages
  • BounceRates: The percentage of visitors who enter the site from that page and then leave (“bounce”) without triggering any other requests to the analytics server during that session
  • ExitRates: It is calculated as for all page views to the page, the percentage that were the last in the session
  • PageValues: Represents the average value for a web page that a user visited before completing an e-commerce transaction
  • SpecialDay: Indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine’s Day) in which the sessions are more likely to be finalized with transaction
  • Month: Month of the year for the session
  • OperatingSystems: Operating system used for the session
  • Browser: Browser used for the session
  • Region: Region of the user
  • TrafficType: Traffic Type
  • VisitorType: Types of Visitor
  • Weekend: Session occurred on a weekend or not
  • RevenueRevenue: Represents whether the user generated revenue or not
Dataset head, first inspection

The dataset can be downloaded in the following site: https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset

Imbalanced Dataset

One concern with this dataset, is that the incidence in the target variable shows the dataset might be imbalanced. Target incidence shows if the dataset is balanced or imbalanced. It is defined as the number of cases of each individual target value in a dataset. This is important since the aim of the project is to predict whether or not a user session generated revenue or not. The model for this is a binary classifier, meaning that there are only 2 possible outcomes:

  1. False(value:0) — The session did not generate revenue.
  2. True (value:1) — The session generated revenue.
Bar plot of the normalized target variable incidence

A 15% target incidence considering the amount of data in the dataset could work, however the proportion is still very small and I consider it to be imbalanced. Hence, I will implement data augmentation techniques in order to boost the target incidence with synthetic data. To tackle the issue of class imbalance, Synthetic Minority Over-sampling Technique (SMOTe) was introduced by Chawla et al. [3] in 2002.

Synthetic Minority Over-sampling Technique (SMOTE)

SMOTE is a technique based on nearest neighbors judged by Euclidean Distance between data points in feature space. For this project I used Synthetic Minority Over-sampling Technique for Nominal and Continuous features (SMOTE-NC) from the imbalanced-learn library, which creates synthetic data for categorical as well as quantitative features in the data set. SMOTE-NC slightly changes the way a new sample is generated by performing something specific for the categorical features. In fact, the categories of a new generated sample are decided by picking the most frequent category of the nearest neighbors present during the generation.

Generating a new synthetic datapoint using SMOTE based on k-nearest neighbors.©imbalanced-learn

As of now the original dataset has been one-hot-encoded and scaled. The data has been split into a training and a testing dataset. It is very important to only apply SMOTE to the training set and not the testing set to avoid contaminating and introducing biases into the models.

Testing

I trained four plain-vanilla machine learning algorithms before applying SMOTE-NC to the training set. The machine learning algorithms are: Decision tree, logistic regression, random forest and gradient boosting. Given the imbalanced nature of the data the best classification scores are the f1 and the area under the curve scores.

In order to compare the effectiveness of performing SMOTE-NC on the dataset, here are the first scores for each of the models after the dataset has already been cleaned, one-hot-encoded, scaled and split.

It is important to note that during the split, using the train test split tool in the sklearn library, the split was done setting the hyperparameter ‘stratify=target’ to keep the same target incidence on both, the training and testing dataset.

#Display scores for each of the models
results3 = baseline_models(data=[X_train, X_test, y_train, y_test])
results3.sort_values('f1',ascending=False)
Scores for each of the models before

So for now, the best performing model is Gradient Boosting with an F1 score of just shy of 67% and a roc_auc score of about 78%. Now let’s perform a default SMOTE-NC on the training set.

0    0.625
1 0.375
Name: outcome, dtype: float64

After applying SMOTE-NC on the training dataset, the new target incidence has gone up by 60% to 37.5% from 15.47%. The factor by which the oversampling gets generated can be specified in the sampling_startegy hyperparameter. This parameter can be set to any percentage of your choosing. If the percentage indicated is 100, then for each instance, a new sample will be created. Hence, the number of minority class instances will get doubled.

Another hyperparameter that could be tuned is k_neighbors. The default is 5 which is the one that worked best for my dataset. This hyperparameter indicates the number of nearest neighbors to be used to construct synthetic samples.

In regards to the increased performance of the models in my dataset, the biggest improvement was to the roc-auc score that went up to a little over 84% from 78%.

Conclusion

SMOTE-NC is a great tool to generate synthetic data to oversample a minority target class in an imbalanced dataset. The parameters that can be tuned are k-neighbors, which allow to determine the number of nearest neighbors to create the new sample, and sampling strategy, which allows to indicate how many new samples to create. It is important to remember to only apply it to the training dataset in order to avoid introducing bias to the model.

References and Further Reading

--

--

Fernando Aguilar
Analytics Vidhya

Data Analyst at Enterprise Knowledge, currently pursuing an MS in Applied Statistics at PennState, and Flatiron Data Science Bootcamp Graduate.