Probabilistic Generative Models — A worked example

James Koh, PhD
MITB For All
Published in
6 min readApr 14, 2024

Introduction

In this article, we will see what makes probabilistic generative models tick. Through the walkthrough below, we would see (1) how data could be distributed in the first place, under the assumptions that they are generated by some class-dependent source which has its respective mean and standard deviation, and (2) how we can apply such a model to make predictions on an observed dataset.

Note that this article is meant to supplement the materials of CS610 lecture 1, and hence I will exclude some of the formulas.

Suppose that we have a dataset of 100 points, coming from a total of 4 distinct classes. For example, these could represent ‘HDB_flat’, ‘condominium’, ‘terrace’ and ‘bungalow’.

Each point is represented as a vector of three floating values. We can think of these as feature vectors as representing some attributes of the points. For example, it could be[price_in_millions, number_of_square_meters_in_hundreds, electricity_consumption_in_100kWh].In fact, the concept of feature vectors goes well beyond just structured data, and can even be applied to images (and more) as I had explained here.

The former is what we commonly refer to as the label y, while the latter is the observation x.

The idea of probabilistic generative models is that each sample will possess properties which following a particular distribution depending on its class. For simplicity, we assume things follow a normal distribution, although this is no in way a constraint to use generative models.

Let’s just focus solely on the first feature, which in the given example is price_in_millions. Under our given premise, we can say that all samples corresponding to HDB_flat would cost $1 million on average, although it could be more or less, according to some standard deviation.

By extension, we can say that each sample is represented by a feature vector that follows a normal distribution (or some other distribution), and consequently given a feature vector we will be able to estimate the probability for which it corresponds to some given class.

Objective #1 — Generating the data

With this, let’s generate the dataset. We will then use this generated dataset to achieve objective #2, which is to run through a worked solution.

import numpy as np

class_params = {
'Class 1': {'mean': [1, 2, 3], 'std': [0.2, 0.4, 0.6], 'pC': 0.1},
'Class 2': {'mean': [4, 5, 6], 'std': [1.0, 1.0, 1.0], 'pC': 0.2},
'Class 3': {'mean': [7, 8, 9], 'std': [1.5, 1.6, 1.7], 'pC': 0.3},
'Class 4': {'mean': [10,11,12], 'std': [2.0, 2.5, 3.0], 'pC': 0.4}
}
n_points = 100

x_holder = {}
y_holder = []
for class_name, params in class_params.items():
mean = params['mean']
std = params['std']
x_holder[class_name] = np.random.normal(
loc=mean, scale=std, size=(int(100*params['pC']), 3)
)
y_label = class_name.strip()[-1]
y_holder.extend([y_label]*int(100*params['pC']))

x_train = np.concatenate(
[x_holder['Class 1'], x_holder['Class 2'], x_holder['Class 3'], x_holder['Class 4']],
axis=0
)
y_train = np.array(y_holder)
print(x_train.shape, y_train.shape) # (100,3) and (100,)

Suppose that 10% of the housing in Singapore areHDB_flat, 20% are condominium, 30% are terrace and 40% are bungalow. In addition, suppose that on avearge, a HDB_flat costs $1 million, has a size of 200 sqm, and consumes 300 kWh of electricity. Meanwhile, a condominiumunit costs $4 million, has a size of 500 sqm, and consumes 600 kWh of electricity. You get the idea.

Let’s not argue about how realistic the numbers are. These are just some reader-friendly numbers.

Running the above code, we get a dataset that looks like the following:

Subset of the generated dataset, for illustration purpose.

If we were to plot just the first two dimensions (not first two principal components, although you can if you want to), it can be visualized nicely as follows.

import matplotlib.pyplot as plt

for class_name in x_holder:
plt.scatter(
x_holder[class_name][:, 0], x_holder[class_name][:, 1],
alpha=0.6
)
plt.xlabel("price_in_millions")
plt.ylabel("number_of_square_meters_in_hundreds")
plt.show()
Distribution of first two features for 100 samples. Colors are obvious.

Objective #2 — Solution given a dataset

We now have a dataset of 100 hypothetical houses. Let’s pretend that these are samples which we observed and recorded. Our objective now is to predict, given the features of a house ([price_in_millions, number_of_square_meters_in_hundreds, electricity_consumption_in_100kWh]), what housing type it actually is.

This brings us to one important learning point. A probabilistic generative model (or in fact any model) is only as good as the validity of its underlying assumptions.

The reason our example is going to work nicely is that the data indeed are generated following a normal distribution. That may or may not be true for the particular dataset which you are working on.

Back to the worked example, given x_train and y_train, let’s start by computing the given mean and covariances of each class.

class_labels = np.unique(y_train)
n_classes = len(class_labels)
n_samples = len(y_train)
priors = []
means = []
covariances = []

for label in class_labels:
x_class = x_train[y_train==label]
priors.append(len(x_class)/n_samples)
means.append(np.mean(x_class, axis=0))
covariances.append(np.cov(x_class, rowvar=False))

Let’s evaluate the findings.

Mean and covariance of the last class, as well as the standard deviation for each of its three features.

Based on the given dataset, the mean for the last class (bungalow) is [10.03, 11.03, 11.99]. Notice that this is incredible close to the true mean values of [10, 11, 12], which in reality we would not know and hence can only make an estimate.

Meanwhile, from the covariance matrix, we infer that the standard deviation for the three features are 2.12, 2.49 and 3.74 respectively. This is somewhat similar to the standard deviations which had been used to generate the samples in the first place, though the small sample size resulted in some errors in the estimate. Furthermore, even though the three features had been generated independently of each other, the covariance matrix suggests that there is a negative correlation between the first two features, and a positive correlation between the second and third feature. Again, this is not surprising, as we are just computing an estimate from a small sample.

Let’s now generate the predictions using the probabilistic generative model. (refer to lecture 1 for the formulas.)

Note that we do not need to write a loop and make a prediction one sample at a time. Instead, we will make use of parallelization and compute everything at once.

import scipy.stats

def posterior_probabilities(x, means, covariances, priors):
posteriors = []
for i in range(n_classes):
likelihood = scipy.stats.multivariate_normal.pdf(
x, mean=means[i], cov=covariances[i]
)
posteriors.append(likelihood*priors[i])
probabilities = np.array(posteriors).T # convert list of n_classes of (n,)
normalized_p = probabilities/np.sum(probabilities,axis=1,keepdims=True)
return normalized_p

y_prob = posterior_probabilities(x_train, means, covariances, priors)
y_hat = class_labels[np.argmax(y_prob, axis=1)]

accuracy = np.mean(y_hat==y_train)
print(accuracy)

Doing so, we get an accuracy of 95% on the training data (I’m not going to do a train/val/test split for the purpose of this tutorial). Let’s further investigate the outputs to gain a better understanding of the predictions.

for y_true, (p1, p2, p3, p4) in zip(y_train[::5], y_prob[::5]):
print("y_true: %s, p1: %.2f, p2: %.2f, p3: %.2f, p4: %.2f" % (y_true, p1, p2, p3, p4))
Predictions are close to the true labels.

Nice, right? Now, it’s time to try it out on your own!

In our particular example, the clusters are separated far apart, relative to their standard deviation. You can try changing the class parameters on your own and observe what happens.

Conclusion

With this, you’ve now gained an understanding of the physical intuition behind probabilistic generative models, as well as learnt how to make predicts using this approach!

If you have any questions, feel free to approach me during the consultation sessions and/or after class.

Additional side note: If you are keen to publish your technical articles on this publication, please approach me directly!

Disclaimer: All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.

--

--

James Koh, PhD
MITB For All

Data Science Instructor - teaching Masters students