Confusion Matrix and Data Imbalances (2/3)

12 min readSep 12, 2023

Previous << Confusion Matrix and Data Imbalances (1/3)

When our data labels have more of one category than another, we say that we have a data imbalance. For example, recall that in our scenario, we’re trying to identify objects found by drone sensors. Our data is imbalanced because there are vastly different numbers of hikers, animals, trees, and rocks in our training data. We can see this either by tabulating this data:

Label     Hiker     Animal     Tree     Rock
Count      400       200        800      800

Note how most of the data are trees or rocks. A balanced dataset doesn’t have this problem. For example, if we were trying to predict whether an object is a hiker, animal, tree, or rock, we’d ideally want an equal number of all categories, like so:

Label     Hiker     Animal     Tree     Rock
Count      550       550        550      550

If we were simply trying to predict whether an object was a hiker, we’d ideally want an equal number of hiker and not-hiker objects:

Label     Hiker     Non-Hiker
Count     1100        1100

Why Do Data Imbalances Matter?

Data imbalances matter because models can learn to mimic these imbalances when it isn’t desirable. For example…

Confusion Matrix and Data Imbalances (2/3)

Why Do Data Imbalances Matter?

Written by The V Notebook