Naive Bayes as a Generative model

setting tone for GANs

Mehul Gupta
Data Science in your pocket
9 min readJul 20, 2021

--

After covering generative modeling basics in the 1st part, this time I will be exploring how, our very own, Naive Bayes acts as a generative model. We all must have used it for some classification task. But data generation? maybe a big no.

Do read out about Naive Bayes here if you are new to it.

But to a surprise, Naive Bayes has been amongst the earliest models used for data generation, though, not for data as complex as images. So, in this post, I will try to walk through how one can use Naive Bayes as a generative model & why using it for generating complex data like images/text is a very bad idea.

Consider yourself as an Instagram model!!

You got yourself clicked in the below attires

https://www.oreilly.com/library/view/generative-deep-learning/9781492041931/

For your next photoshoots, you need a few new combinations. You definitely don't wish to repeat any old attire for sure

But, now you just can’t think of any new, apt combination to adore. Can generative modeling especially Naive Bayes help generate apt possible unique combinations to choose for you?

YESSS !!

Before moving on, let us remember 2 things

  • You don’t have anything extra items other than those observed in the above images. So, you can have just the given values for each of the ‘features’:

Hence, you can have a possibility of

7(hairstyles) x 6 (hair color) x 3(specs) x 4 (tops) x 8 (colors)=4032 unique combinations. So the entire sample space has 4032 points. Also, our training dataset doesn’t directly intake images as input, but tabular data in form of the above-mentioned features.

  • Your previous photos do give a very strong hint: the probability of choosing a few things in a particular category is higher than other items in the same category like you prefer silver-grey curly hair over others, similarly white tops over green ones. You wish your model to give the same preference

So what does the 2nd point mean mathematically?

It means the probability of all attires isn’t constant & does follow some distribution p_data (which is definitely non-uniform, what we called pdf in my last post). You do prefer a few things over other items in your wardrobe.

Now, if you remember my previous post, these terminologies do look similar & smells off a generative model. Yeah, the idea is

To estimate this pdf: p_data using some pdf: p_model given the above sample insta images tabular data (dataset X)

Generate new, apt combinations to choose from for newer insta images using estimated pdf: p_model

To estimate pdf: p_data using parametric modeling we need to estimate the parameters that can represent this distribution p_data. Now, the most novel method is to consider all possible points in sample space as parameters. Hence, we pass 4032 (total possible combinations )–1=4031 points as parameters that are the probability of points in sample space.

Why -1?

As the probability of 4032 points should add up to 1, we just need to know the probabilities of 4031 combos. The last point can be calculated using 1-the summation of 4031 probability.

That’s fine, but how to calculate the probability of these 4031 points (attire in our case)? Again, following the easier way,

P(any attire) =

Frequency of an attire observed in-sample data / Total samples

So if you wore a Red round Top with white hair & no glasses 4 times in say about 100 Instagram images, then

P(Red & Round Top & white hair & No glasses) = 4/100

what about unseen combinations? would a 0 probability be correct? No

We can go for additive smoothing for cases we don’t have any combination (adding an epsilon in the numerator & denominator) so that the Probability for unseen combinations is not 0.

But a deep, hidden problem still exists.

If you remember, you do have a strong craving for silver hair & white tops. Hence, going with genuine logic, any unseen combination with either silver hair or white tops, or both should have a higher probability.

What if you get a color very close to white in your wardrobe? say eggshell white. As it is very close to being called white (but not exactly white), should any combination including egg-shell white have a slightly higher probability than colors such as green or yellow?

We will focus on problem 1 for now

In our above scenario, we calculated the probability of any combination = its total occurrence/ total images else some ‘k’

The above approach of assigning probabilities to all possible attires (samples) in the sample space would assign some probability ‘k’ to all unseen combinations which won’t depict our pdf: p_data as we know the model should assign a higher probability to some unseen combos than others which have the preferred elements.

If observed closely, this directly means that this sort of probability assignment assumes parameters to be almost 100% dependent on each other i.e. they occur together or they just don’t. A big drawback can be explained below:

It may be the case you loved Silver hair, white round tops with glasses but you never thought of going without glasses with this combination. Now, if we wish to calculate the preference of (Silver, curly hair, White, round top without glasses) the previous approach will assign it the same probability as some Random hair with a random colored top with a random glass/no glass which hasn't been observed which is wrong as 4 out 5 items are usually preferred by you. Why does the absence of just one feature (no glasses here) make this combination look like some other, random combination? It should be higher than such random combos !!

The problem arises as we assume that

P(A ,B ,C ,D ,E) = P(A ∩ B ∩ C ∩ D ∩ E) — — Formula_1

Hence, all or nothing. If all don’t occur together at least once, then probability= ‘k’ (constant for unseen combinations)

What about

P(A ,B ,C ,D , E) = P(A) x P(B) x P(C) x P(D) x P(E) — — Formula_2

This solves a major problem we are facing in our above example: avoiding the all-or-nothing sort of situation. Revisiting the same scenario, if you get 4 out 5 attributes as you love, this combination has a higher chance of having a higher probability than a random combination unlike earlier.

This is what Naive Bayes does, it assumes all the features are independent i.e. choice of hairstyle has nothing to do with your top color, your top’s style is independent of the accessories you wear, etc. Hence, our probability distribution function changes from Formula_1 to Formula_2. To calculate

P(Black hair, long curly hair, pink top, round top, no glasses) =

P(Balck hair) x P(long curly hair) x P(pink top) x P(round top) x P(no glasses)

Where P(any element x) = occurrence of X / total samples. So

P(Balck Hair) = Frequency(Black hair)/ Total insta images

Now, recollecting our naive probability distribution function which had 4031 parameters, Naive Bayes reduces this number to just 23 !!

How?

As Naive Bayes assumes all features are independent. Hence, total parameters for pdf would be

7(Hairstyle)–1 + 6(hair color)–1 + 3(glasses)-1 + 4(clothing type)-1 + 8(clothing color)-1 = 23

Why a -1 for every feature? same reason as stated earlier, for example in hair color, if the probability for 5 colors is known, 6th can be calculated.

Why is the total parameter count calculated using A+B+C…. & not AxBxC…. (as earlier)?

As the features are independent, the probability will be calculated as:

P(Black hair| pink top) = P(Black hair) x P(pink top)

& P(Balck hair|round top) = P(Black hair) x P(round top)

P(Black hair ) will remain constant & hence, P(Black hair) can be used to calculated both the probabilities which wasn’t the case earlier when we assumed parameters were dependent. In our 1st pdf, we need to calculate

P(Black hair)∩pink top) & P(Black hair∩round top) as features were dependent which can’t be calculated using just P(Black hair) hence the probability of all combinations that include Black hair is calculated individually.

So, till now, we have come to know how Naive Bayes be helpful in generating new data similar to a given sample, less complex, dataset.

But, can we use it for generating images? let’s see

The game’s changed a bit, now you wish your model to, rather than just suggesting combinations, generate mock pictures so that you can have an idea of how would you look in the new attire. Also, we won’t be feeding tabular data this time. What we will be feeding are the original, historic, Instagram images from your account & let the model generate new poses, dress combinations for you

This is exciting !!

You fed your Naive Bayes model:

https://www.oreilly.com/library/view/generative-deep-learning/9781492041931/

But the output sample images generated looked like below:

https://www.oreilly.com/library/view/generative-deep-learning/9781492041931/

Awful results

So, what actually went wrong?

In our previous scenario, we fed in [n x 7] rows to train our Naive Bayes where n=number of samples/image i.e 50. The feature values were directly provided what we call out tabular data.

In this scenario, the data has become complex! i.e. we are now providing

[n x 64 x 64] entries to train our Naive Bayes where 64 x 64 is, say, the image dimensions. If you observe, these images don’t provide any clue about the hairstyle or top style or colors, etc. making the model extract these features which is a tough ask for Naive Bayes.

Things that went wild when we use Naive Bayes in this case

  • Naive Bayes assumes all features to be independent. This assumption helped us in the 1st scenario where we need different combinations. Right? Now, this has become a bottleneck. How? It has no way to know that adjacent pixels near the ‘top’ region belong to one entity & should look similar. Hence, when it got the training dataset, It is able to portray the facial color & lips with decent accuracy but not ‘tops’ as they kept on changing rapidly in the different images & hence the final outputs are more-or-less a mix of pixels. The face & lips part, though, remained constant in the training samples & hence reciprocated comparatively better. As it saw a variety for ‘top colors’, all possible values for ‘top colors’ at pixels in these regions were assigned more-or-less same probabilities & hence the final output looks a complete mixup. A similar pattern can be observed for hair & eyes/glasses where values changed in different samples & as we have no way of letting the neighboring pixels know other neighboring pixels value, the color assigned to each pixel do look independent & a complete mess.
  • Also, in our 1st case, the sample space was small (4032 points) & each point possible generated a valid (but maybe not preferred) combination. When considering images, we have a humongous sample space where each pixel can have 255 unique values & we have 64 x 64 =4096 pixels. Also, not all combinations would lead to a valid image. Hence to 1)Determine the part of sample space that would lead to a valid image is tough 2) It even becomes tougher when the dependence between pixels is ignored as by Naive Bayes. Pixels in any image are usually correlated in small patches (like top, faces, lips in our cases). Such correlation can’t be ignored !!

Can we get over these limitations? Let’s find out in my next where we would start out with Autoencdoers & a big milestone in Generative modeling, VAEs.

And here, we will wrap this up. Also, a big thanks to Generative Deep Learning by Oreilly for helping me out with this post.

--

--