Using Crucio SMOTENC for unbalanced datasets

Published in

softplus-publication

3 min readJun 16, 2021

Unbalanced data

In simple terms, an unbalanced dataset is one in which the target variable has more observations in one specific class than the others. An example can be a dataset for transactions: usually, 99% of this set consists of simple, not fraudulent transactions, and only 1% of them are fraud.

How does SMOTENC work?

SMOTENC (Synthetic Minority Over-sampling Technique for Nominal and Continuous) is actually very similar to classic SMOTE but, unlike SMOTE, SMOTE-NC is able to work with both numerical and categorical features.

SMOTE-NC slightly changes the way a new sample is generated by performing something specific for the categorical features. In fact, the categories of a new generated sample are decided by picking the most frequent category of the nearest neighbors present during the generation, and while calculating Euclidian distance when searching for k-nearest-neighbors, the algorithm just adds the square of the median of standard deviations of every column that represent numerical values.

Using Crucio SMOTENC

In case you didn’t install Crucio yet, then use in the terminal the following:

pip install crucio

Now we can import the algorithm and create the SMOTENC balancer.

from crucio import SMOTENC
smotenc = SMOTENC()
balanced_df = smotenc.balance(df, 'target')

The SMOTENC() initialization constructor can contain the following arguments:

k (int > 0, default = 5) : The number of nearest neighbors from which SMOTE will sample data points.
seed (int, default = 45) : The number used to initialize the random number generator.
binary_columns (list, default = None) : The list of binary columns from data set, so sampled data be approximated to nearest binary value.

The balance() method takes as parameters the panda’s data frame and the name of the target column.

Example:

So I chose a data set where we have to predict the type of a Pokemon (Legendary or not), the Legendary class constitute 8% out of all dataset, so it is definitely an imbalanced dataset.

We recommend you, before applying any module from Crucio, to first split your data into a train and test data sets and balance only the train set. In such a way you will test the performance of the model only on natural data.

The basic Random Forest algorithm gives an accuracy of approximately 97% by training on imbalanced data, so now it’s time to test out SMOTE algorithm.

smotenc = SMOTENC()
new_df = smotenc.balance(df,'Legendary')

After balancing data, we increased accuracy from 97% to 100%.

Conclusion:

SMOTENC is a very good technique to balance datasets when you have categorical columns and you don’t wanna lose their predicting power. I encourage you to test it and many other balancing methods from Crucio such as SMOTETOMEK, SCUT, SLS, ADASYN, ICOTE.

Made with ❤ by Sigmoid.