How to deal with Unbalanced Dataset in Binary Classification — Part 1
Re-Sampling procedures with Python
Whenever we initialize a task for a Machine Learning model, the very first thing to do is analyzing and reasoning on the data we are provided with and will be using for training/testing purposes. Indeed, it is often the case that even before thinking about the model to use, we might need to re-architect the dataset or at least incorporate in the training some features to deal with initial data conditions.
One of those conditions is that of unbalanced data, and in this article, I’m going to focus on unbalanced datasets within binary classification tasks.
The curse of Unbalanced Dataset
We face an imbalance in data whenever their dependent variable (either continuous in regression tasks, or categorical in classification tasks) is very skewed in terms of distribution. Namely, consider the following example.
Imagine our task is that of building a model that is able to identify from credit card transactional data which transactions are fraudulent. To do so, we need a dataset of past transactional data whose being fraudulent or not has already been assessed: in other words, those data are labeled (so we are in a supervised learning domain). As a matter of…