Imbalanced Classification: A Complete Road Map

with python related functions

Published in

The Startup

5 min readSep 9, 2020

The very interesting problem of imbalanced classification is quite famous in articles and academic papers. Most of the work focus is on one part of the big image where it addresses a specific data set and discusses possible solutions. So eventually, you have to open more than 10 tabs in one browser to learn about the problem and its possible solutions. Here I collected a complete road map so you can see the complete image of all the steps you have to go through from dealing with your data till you end up with an informative conclusion based on your question of interest.

Before starting this journey together. Let’s talk first why imbalanced classification is important?! to industry people not just the nerdy academic people, and what are the applications that suffer from this problem by nature and some other applications that happen to have imbalanced classes due to customers' behavior?!.

Imbalanced classification refers to having unequal distribution classes. Talking business-wise imagine you have released two products in the market, and you found 90% of your customers prefer one product over the other one. At some point, you will get back to your data team asking to explain the customers' behavior based on the customer characteristics! to be able to understand this behavior and the potential change that would push them to get the other less liked product or to adjust this product based on the customers’ preferences. There are many famous applications for imbalanced classification which are expected to show up due to the nature of this application such as fraud detection, large claim losses in insurance applications, spam mails, hardware failure,.., etc. Some other applications just happen due to unexpected customers' behavior which you can’t anticipate but you have to deal with it when it happens.

In this article, I will go through the general 3 steps of imbalanced classification analysis as previewed in the image below

I will explain the details of the available options in each step. This is in addition to highlighting some pitfalls and tricks you need to be aware of when dealing with imbalanced data.

Data cleaning and preparation

This part of the process needs your clear knowledge about the features and targeted interpretations. Generally, you have to study the features of your data very well using some preliminary tools, such as descriptive statistics and correlation matrix, to make sure you are not adding overlapping information to your model. In the case of highly correlated features, you can use principal component analysis, for example, to solve this problem.

It is important to consume good amount of time in data cleaning and preparation, eventually it saves you a lot of effort in later steps

What you need extra! in the data preparation related to the machine learning model is to use the one-hot-encoder in case of having a categorical feature(s). Basically, by using this function you create dummy variables, or in other words, you transfer the categories into features and observe their effect on the classification process. The final step here is to divide your data into training and testing set using the train_test_split function. Using a random split is not necessary all time.

Modeling

The main two schools of modeling are:

Engineer your data using preprocessing techniques, then use the models for balanced classification.
Use a model that has a specially constructed cost function which gives more penalty for misclassifying the minority class.

Preprocessing and using traditional models for balanced data is more famous between articles and research papers. There are several preprocessing techniques which mainly divided into three main types: oversampling, undersampling, or mixture of the two.

almost all techniques work with continuous feature(s) but not all of them are applicable in case that you have a categorical feature(s). In case you have only categorical features, you can use random oversampling or random undersampling. For the mixture of continuous and categorical features, the options are random oversampling, random undersampling, or SMOTE-NC.

The adapted models as mentioned is somehow can be tailored based on your chosen cost function.

But you need to be careful as, the effectiveness of these models heavily depends on quality of the cost function

There are some models that have built-in weight functions such as the Adaboost classifier or some other models where you can add weight using the argument class_weight such as logistic regression and ridge classifier.

If your target of the model is to estimate the classification probability, then you need one more step which is to calibrate the resulted probability. Probability calibration is an important step when using models that don’t have a probability-based structure such as SVM, random forest, and gradient boost. These type of models produce a probability-like-score which needs to be calibrated. Other models like logistic regression don’t need the extra calibration step, so you need to be aware of the structure of the model you are using.

Evaluation metric

Before going through this part you should have a clear concrete answer of the following question

What is the target of imbalanced classes analysis, is it labels prediction or probabilities prediction?!

The answer to this question decides preciously which metric you need to use. In case your answer is labels predication, then you can use F0.5_score when false positive is more costly, F1-score when false negative and false positive are equally costly, and F2_score when false negative is more costly. If you are targeting probabilities prediction, you have two options: log loss score and brier score.

Finally as a conclusion of the important remarks in this article. First, consume more time in your raw data and try to understand as much as you can. Second, there is no perfect model for all data you can try different models and choose the best for your data. Parameter tuning and cross-validation are also important. Third, be clear about your target to choose a suitable evaluation metric. Don’t be tricked by metrics such as accuracy which doesn’t represent true efficiency in the case of imbalanced classes.

Thanks for your reading, please feel free to share and open discussions in the comments.

Imbalanced Classification: A Complete Road Map

with python related functions

Data cleaning and preparation

Modeling

Evaluation metric

Written by Hanan Ahmed