Data Science for Software Engineers — Joint Probability Matrices

Daniel Olshansky
Analytics Vidhya
Published in
6 min readMay 25, 2020

Joint Probability Matrices — An “Extension of the Confusion Matrix” for Continuous Variables

If you’re just interested in the code, the Jupyter notebook is available here. Big thanks to Albert Altarovici for explaining fundamental Data Science concepts and reviewing this article.

When measuring the performance of a prediction model (i.e. a Machine Learning classification algorithm), there are 4 metrics you are likely to measure and reference: Precision, Recall, Accuracy and F-score. A common visual method to interpret these results is via a Confusion Matrix. A lot of great articles have already been written about these topics, so I won’t delve into too many details.

For a binary classification problem, the ground truth data (also referred to as target values) are binary (0 or 1), and the predicted value is some value between 0 and 1. There are a few things to keep in mind here:

  • A threshold needs to be selected to determine if a predicted value should be rounded up or down. This threshold needs to be chosen based on some cutoff that balances tradeoffs in your model.
  • For categorical data (more than two options), we need to use one-hot encoding and perform multiple pairwise analyses.
  • Classification metrics (accuracy, prediction, recall, f1-score) are only defined for discrete values. When predicting a continuous value, we are generally more interested in regression metrics (e.g. RMSE, MAE) as described well in this Stack Overflow answer.

I was recently introduced to joint probability matrices, which can be thought of as an extension to confusion matrices when your data is non-categorical, but you still want to visualize the data patterns in a similar fashion. Most importantly, it can be used to easily identify outliers in your data, which will lead you to specific examples where a large discrepancy occurs between your target values and predicted values, which will hopefully help you fine tune your model. This is easiest to illustrate via an example.

Consider a class of 30 students who took a test and received a grade between 0 and 100 that is normally distributed. Let’s assume that the mean and standard deviation are 80% and 20% respectively; this will be our ground truth data. Our model will predict a grade for each student that is also between 0 and 100. For simplicity, we’ll model our predictions using a random set of numbers that has the same distribution as the actual data.

The following snippet of code will generate our mocked data:

Next, we illustrate a few methods of comparing our ground truth data to our predicted data using regression analysis methods. The following snippet of code will output several regression metric values, plot the Probability Density Function, as well as a histogram showing the frequency distribution of the grades. Fritz AI has a fantastic article about the different regression analysis methods and their tradeoffs, so we won’t delve into the details.

The code snippet above will generate the following output:

mean_squared_error:  733.4
mean_absolute_error: 21.54
explained_variance_score: -1.15

And the following graph:

From the graphs above, we see that the two data sets follow a similar pattern, which is expected because they’re both normally distributed with the same mean and standard deviation. However, the regression analysis metrics (i.e. Mean Squared Error) are showing that our data is essentially senseless, which is also expected because we generated our data randomly. This means that while our on aggregate, the distribution is similar, individual values (e.g. predicted grade of Student A vs actual grade of StudentA) are very different.

In order to determine where the biggest variations lie, we can use a joint probability matrix.

The joint probability matrix for the data above will look as follows:

How do we interpret this data? For example, the red title on the 2nd last row, with a value of 0.07 means that 7% of the students (2 in our case) who actually received a grade between 80%-90%, were predicted to receive a grade between 50%-60% by our model. Next, one would manually inspect this small portion of hand-picked outliers and start drawing conclusions about changes that need to be made to the prediction model. This sort of analysis could lead to interesting followup investigations, but was obfuscated by the regression analysis done above. Perhaps the model is is very biased to under-predict the grade of high performers, perhaps it is just a bug in the code, etc…

The table above can be generated using the following code snippet:

By making the following function call:

bins = np.linspace(0, 100, 11)
create_joint_probability_matrix(grades_GT, grades_P, bins)

All you need to provide is a list of the target values, the predicted values, and a list of bins of how you want to discretize the data.

In the ideal scenario, where our model is 100% accurate in predicting our ground truth data, all of the values in the joint probability matrix will lie across the diagonal. For example, building a joint probability between the the same set of values will produce the following output:

create_joint_probability_matrix(grades_GT, grades_GT, bins)

For example, the table above shows that 27% of students who had an actual grade between 60–70%, were also predicted to have a grade between 60–70%. More importantly, all cells outside of the diagonal above are 0, meaning our model made no mistakes (because we’re comparing the exact same set of values).

That’s all there is to it. This doesn’t replace anything in your toolbox, but simply provides an additional method to inspect discrepancies in your predicted data for continuous variables, which can hopefully lead you down the right path in tuning your model.

Feel free to stop reading now if you’re not interested in a breakdown of the create_joint_probability_matrix function above.

For those who are less familiar with Pandas, like myself, this section will describe some of the operations we performed above.

Once we have our data, we need to split it into 10 equally sized buckets. This decision is kind of arbitrary and up to the discretion of the analyst.

The output of the the cut function will simply assign a range (i.e. a bucket) to each value. The output of the above code snippet will output:

Next, we need to transform the DataFrames so they can be merged together:

I personally found it easiest to understand what the commands above do by inspecting small portions of the DataFrame at each step.

  1. Assign a student ID (i.e. index) to each student and associate it with the bucket that their grade is in; both in the predicted and ground truth tables.
  2. Merge the two tables based on the student ID.
  3. Create a multi-leveled pandas DataFrame that provides a count of the number of students

Next, we take the 2D array and convert the count values to percentages based on the total number of students in our dataset:

This will produce the following output:

A quick check to make sure this step is correct is by verifying that the sum of all values in your DataFrame add up to 1.

Lastly, we simply want to format the output table so it’s more readable by rounding our values, and applying a background gradient to easily see where the outlier cells are. The following code snippet:

Will format the table above into this one:

--

--