Confusion Matrix and Data Imbalances (1/3)

The V Notebook
10 min readSep 12, 2023

Let’s think of data as continuous, categorical, or ordinal (categorical but with an order). Confusion matrices are a means of assessing how well a categorical model performs. For context as to how it works, we will first refresh the knowledge about continuous data. Through this, we can see how confusion matrices are simply an extension of the histograms we already know.

Continuous Data Distributions

When we want to understand continuous data, the first step is often to see how it’s distributed. Consider the following histogram.

We can see that the label is, on average, about zero, and most datapoints fall between -1 and 1. It appears as symmetrical; there are an approximately even counts of numbers smaller and larger than the mean. If we wanted, we could use a table rather than a histogram, but it could be unwieldy.

Categorical Data Distributions

In some respects, categorical data aren’t so different from continuous data. We can still produce histograms to assess how commonly values appear for each label. For example, a binary label (true/ false) might appear with frequency like so:

--

--

The V Notebook

I'm👩‍💻who have passion for tech, heart for data. My mission? Turning numbers into chapters, algorithms into stories. Let's ride the data science wave! 💻🌊✨