Using Crucio SMOTE for balancing data
This will be a new series of articles related to our brand new Python library made with love by Sigmoid. Its name is just like the second unforgivable spell from the Harry Potter’s series, Crucio. This library was created specifically for unbalanced data sets, with a lot of different methods that can be useful in different situations.
If you missed our first library, Kydavra, which is created to do feature selection, then I suggest you check it out after reading this article.
What is SMOTE?
SMOTE (Synthetic Minority Oversampling Technique) is a very popular and simple technique for balancing data, which is based on KNN algorithm.
Using SMOTE from Crucio
If you still haven’t installed Crucio just type the following in the following in the command line.
pip install crucio
Now we have to import and use our algorithm
from crucio import SMOTEsmote = SMOTE()
new_df = smote.balance(df,'target')
The SMOTE() initialization constructor can contain following arguments:
- k (int > 0, default = 5) : The number of nearest neighbors from which SMOTE will sample data points.
- seed (int, default = 45) : The number used to initialize the random number generator.
- binary_columns (list, default = None) : The list of binary columns from data set, so sampled data be approximated to nearest binary value.
The balance() method takes as parameters the panda’s data frame and the name of the target column.
So I chose a data set where we have to predict the type of a Pokemon (Legendary or not), the Legendary class constitute 8% out of all dataset, so it is definitely an imbalanced dataset.
The basic Random Forest algorithm gives an accuracy of approximately 88% by training on imbalanced data, so now it’s time to test out SMOTE algorithm.
smote = SMOTE()
new_df = smote.balance(df,'Legendary')
new_df is now a balanced training data, and now we will train Random Forest on this data, and test on the same data that we did before balancing, and now it gives us a 100% accuracy.
And here is a little plot demonstrating how new examples were sampled.
SMOTE is a very good technique to use when you have an unbalanced data set, so I encourage you to test it with some others balancing methods from Crucio such as SMOTETOMEK, SMOTEENN, ADASYN, ICOTE.
Made with ❤ by Sigmoid.