Source: https://www.istockphoto.com/it/immagine/balance?excludenudity=false&phrase=balance&sort=mostpopular

How to deal with Unbalanced Dataset in Binary Classification — Part 1

Re-Sampling procedures with Python

Valentina Alto
DataSeries
Published in
6 min readJan 24, 2021

--

Whenever we initialize a task for a Machine Learning model, the very first thing to do is analyzing and reasoning on the data we are provided with and will be using for training/testing purposes. Indeed, it is often the case that even before thinking about the model to use, we might need to re-architect the dataset or at least incorporate in the training some features to deal with initial data conditions.

One of those conditions is that of unbalanced data, and in this article, I’m going to focus on unbalanced datasets within binary classification tasks.

The curse of Unbalanced Dataset

We face an imbalance in data whenever their dependent variable (either continuous in regression tasks, or categorical in classification tasks) is very skewed in terms of distribution. Namely, consider the following example.

Imagine our task is that of building a model that is able to identify from credit card transactional data which transactions are fraudulent. To do so, we need a dataset of past transactional data whose being fraudulent or not has already been assessed: in other words, those data are labeled (so we are in a supervised learning domain). As a matter of…

--

--

Valentina Alto
DataSeries

Data&AI Specialist at @Microsoft | MSc in Data Science | AI, Machine Learning and Running enthusiast