Confusion Matrix and Data Imbalances (2/3)

The V Notebook
12 min readSep 12, 2023

Previous << Confusion Matrix and Data Imbalances (1/3)

When our data labels have more of one category than another, we say that we have a data imbalance. For example, recall that in our scenario, we’re trying to identify objects found by drone sensors. Our data is imbalanced because there are vastly different numbers of hikers, animals, trees, and rocks in our training data. We can see this either by tabulating this data:

Label     Hiker     Animal     Tree     Rock
Count 400 200 800 800

Note how most of the data are trees or rocks. A balanced dataset doesn’t have this problem. For example, if we were trying to predict whether an object is a hiker, animal, tree, or rock, we’d ideally want an equal number of all categories, like so:

Label     Hiker     Animal     Tree     Rock
Count 550 550 550 550

If we were simply trying to predict whether an object was a hiker, we’d ideally want an equal number of hiker and not-hiker objects:

Label     Hiker     Non-Hiker
Count 1100 1100

Why Do Data Imbalances Matter?

Data imbalances matter because models can learn to mimic these imbalances when it isn’t desirable. For example…

--

--

The V Notebook

I'm👩‍💻who have passion for tech, heart for data. My mission? Turning numbers into chapters, algorithms into stories. Let's ride the data science wave! 💻🌊✨