Intuition Behind Probabilities From Supervised Learning

Understanding the underlying principles, going beyond just ‘softmax’

Published in

MITB For All

8 min readAug 9, 2023

Why is an image 70% dog and 20% cat & 10% rabbit? How did the model training converge to this? Is there any theoretical basis, or does it happen by chance? Read on to find out.

1. Introduction

I believe in writing evergreen articles. One that would remain relevant years later. One that explains the fundamentals and helps strengthen the foundation of one’s ML knowledge.

Today, I will explain a question which I had been asked during the Applied Machine Learning course: “How can an image be, say, 70% dog and 20% cat & 10% rabbit? How did the model training converge to this? Is there any theoretical basis, or does it happen by chance?”

Although this example corresponds to computer vision, the same concept can be expanded to other non-visual classification tasks as well.

Simply saying “it is the softmax of the logits” is an extremely poor answer. Sadly, it is also what many people would say, and hence students do not get the full picture of what is really going on. Today, I will explain the following:

What really is the rationale/physical intuition, and theoretical basis, leading to a model predicting these percentages?
Is predicting such percentages the best a model could do?

The short answer is “It corresponds to what is learnable from the training dataset. Yes, it is the best.”

The long answer is in this post. (Note that this post assumes you have all the prerequisite knowledge of how CNN works and the underlying principles behind it).

2. About Features

First, we need to understand the concept of ‘features’. Features are, casually speaking, the presence/strength or absence of certain attributes. (Of course, values in a feature vector or feature map are not confined to [0,1], hence i said ‘casually speaking’). In the case of computer vision, the lower-level feature maps could refer to the presence or absence or a shape/texture at the respective regions on the images. The higher-level features maps, meanwhile, could correspond to the presence or absence of ‘a round eye-like structure’ or a ‘sharp teeth-like structure’ or a ‘long yellow ear-like structure’ at the respective regions of the image. For visualization, it may be helpful to consider the following; suppose at the m^th layer we have a 7x7 feature map with 512 channels.

Visualization of feature map. Disclaimer: The different channels are not represented with subscript or superscript; instead, colors are used so that it is reader-friendly. Image by author.

Moving on, the feature map is condensed into a feature vector, which in turn is passed through multiple fully-connected layers. The final layer (right before the output logits/probabilities), could be a vector of length 2048. For example, the first value could correspond to the presence of a “(round eye at the top left OR middle left) AND (NO sharp teeth at the left) AND (NO long ears)”. In reality, the computer vision model learns much more complex features, though it is fair to understand the concept using the English words given in the example.

Illustration of what a feature vector represents

Let’s check back on the original question we wanted to answer, which is the intuition for a model predicting something like 70% dog and 20% cat & 10% rabbit.

As humans, we judge what an image is based on the features that we observe visually. We have an idea of what combinations of features would correspond to, say, a cat. For computers, this is done via the features which I talked about earlier.

To prevent overfitting, and due to memory constraints, the feature vector is finite. In our example above, the final feature vector is represented by 2048 values. Each value is a complex combination of the earlier layers of feature maps or feature vectors. Even a vector with a billion values cannot represent all possible combinations, much less to say using just 2048 values. Therefore, we have to resort to generalization. And because of this generalization, different objects could end up sharing similar features.

For the ease of explanation and readability, let’s make the following simplifications.

Our universe has just three classes: dog, cat, and rabbit.
Condense the feature vector just a single value. Further, suppose that this single value represents the feature ‘large black nose’.
Assume the values either take the values of 0 or 1.

Just stay with me. I assure you once you understand the above, the concept can also be extended to the ‘real’ case without the above simplifications.

3. A Worked Example

[3.1] Scenario

Consider a dataset of 200 images, which comprises 75 dogs, 59 cats and 66 rabbits. For simplicity, let’s just treat everything as a train set, and not talk about class imbalance. If we are forced to use exactly one attribute, either ‘No large black nose’ or ‘Has large black nose’, to describe each image, one scenario we might end up with is as follows:

This concept of ‘being forced to use one attribute to describe the image’ is analogous to what happens when we have a feature vector of length 1. Of course, the value corresponding to the feature vector goes beyond just one ‘word’, as alluded to by the multi-colored a1 to a2048, but the constraint remains — we are forced to conclude whether the image has, or does not have, something.

Now, each time the model encounters a 0 for the feature (we could roughly say an animal is observed without a large black nose), the model predict whether that is a dog /cat /rabbit. Let’s represent these probabilities as (a,b,c). These probabilities which the model output will remain fixed across all 100 ‘no-large-black-nose’ images, because that is the only feature learnt (under the assumptions stated above). Meanwhile, when the model encounters a 1 for the feature (a ‘large-black-nose’ animal is observed) it outputs the predicted probabilities as (d,e,f) for all 100 such images.

Given that a large black nose might correspond to a dog or cat or rabbit, the model inevitably cannot be perfectly accurate in all predictions. To give a very simple analogy, it is akin to asking a model to predict ‘big’ or ‘small’ in a casino dice game after observing the dice before throwing — there simply isn’t any basis to make such a distinction.

[3.2] Loss

Going a step further on the technical aspects, let’s consider how ‘correct’ or ‘wrong’ a model’s prediction can be. For this multi-class classification case, let’s use the standard cross entropy as a measure of how the model is performing. If so, we will end up with the following table.

Prediction, and corresponding loss, for each image in the dataset

Now, what is the lowest cross entropy loss that can be achieved? In other words, what’s the optimal output? On this particular dataset, the total loss across one epoch will be

— [5 log(a) + 39 log(b) + 56 log(c) + 70 log(d) + 20 log(e) + 10 log(f)]

The probabilities are coupled. This is because when an animal with a large black nose is observed, we have a + b + c = 1. On the other hand, when no large black nose is observed, we have d + e + f = 1.

[3.3] Converged model

We can obtain the above total loss simply by summing up all the losses from the 200 rows in the table, and it actually matches what would be obtained by formula. From information theory, we know that lowest cross entropy corresponds to when the predicted distribution is the same as the observed distribution (ie. ratio of a:b:c should be 5:39:56, and ratio of d:e:f should be 70:20:10). If you are not convinced, you may run a grid search or random search to try other combinations; no better solution exists. In summary, we have

Furthermore, training by gradient descent would give you results that are very close to the theoretical numbers stated. If you want to, you can build a simple neural network with 200 samples, each of which is represented by a tensor x of shape (200,1), where x[:100,] = 0 and x[100:,] = 1, with the target y corresponding to what is given in the table above. You do not need images for this experiment; we can prescribe that the feature vector is a result from some hypothetical feature extractor.

4. Conclusion

This answers the questions raised in the beginning.

What is the rationale/physical intuition, and theoretical basis, leading to a model predicting these percentages? → Images of different classes exhibit similar features and are not perfectly distinguishable. The output corresponds to the proportion observed in the training dataset, so as to minimise the cross entropy loss.
Is predicting such percentages the best a model could do? → Yes. (Of course, subject to the usual assumptions of supervised learning, such as having the train and test distributions to be similar).

This article is getting long, and I will wrap up now. In my next article, I will follow-up on this and include some code for you to replicate the results, as well as extend the scenario to remove the above assumptions, and explain how the concepts above still hold. The code would be generalizable enough for you to try out the scenario where the feature vector is larger, and the values are not confined to {0,1}.

If you have any comments or questions, please feel free to post in the comments, and I will be happy to discuss further! If you have requests for discussing certain topics, please do share your suggestions as well!

Disclaimer: All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.