Why is Naive Bayes’ theorem so Naive?

Chayan Kathuria
The Startup
Published in
6 min readMay 17, 2020

Naive Bayes’ algorithm is a classification algorithm based on the famous Bayes’ theorem. So let’s first understand what Bayes’ theorem says and build the intuition for Naive Bayes theorem, how it works, and what is so naive about it?

Bayes Theorem

Before diving into Bayes’ theorem, we need to understand a few terms —

  1. Independent and Dependent events
  2. Marginal Probability
  3. Joint Probability
  4. Conditional Probability

Independent and Dependent events

Consider 2 events, A and B. When the probability of occurrence of event A doesn’t depend on occurrence of event B, then A and B are independent events. For example, if you have 2 fair coins, then the probability of getting heads on both the coins will be 0.5 for both. Hence the events are independent.

Now consider a box containing 5 balls — 2 black and 3 red. The probability of drawing a black ball first will be 2/5. Now the probability of drawing a black ball again from the remaining 4 balls will be 1/4. In this case, the two events are dependent as the probability of drawing a black ball for the second time depends on what ball was drawn on the first go.

Marginal Probability

Marginal probability is nothing but the probability of an event irrespective of the outcomes of other random variables, e.g. P(A) or P(B).

Joint Probability

Joint Probability is the probability of two different events occurring at the same time, i.e., two (or more) simultaneous events, e.g. P(A and B) or P(A, B).

Conditional Probability

Conditional Probability is the probability of one (or more) event given the occurrence of another event or in other words, it is the probability of an event A occurring when a secondary event B is true.e.g. P(A given B) or P(A | B).

Intuition

So consider the previous example of a box with 3 red and 2 black balls. The marginal probability of picking up a red ball on the first go will be 2/5. Let this be P(A). Now, from the remaining 3 red and 1 black ball, the probability of drawing another black ball will be 1/4 which is P(B|A). Now this is the conditional probability of drawing a black ball given a black ball has already been drawn out during the first go, which was event A.

Now if we multiply both these probabilities, we will get 1/10 which is the joint probability P(A,B).

Multiplication Rule

Plugging in the values in the above equation for P(A,B) and P(B), we will get P(B|A) = 1/4. So 1/4 is the conditional probability of event B when A has already occurred (dependent events).

Similarly, we can also define P(A|B) as:

Multiplication Rule

Now we very well know that P(A,B) = P(B,A). Therefore, equating both the equations we get:

Bayes’ theorem

Here, P(B) is the prior probability, P(A|B) is the likelihood, P(A) is the marginal probability and P(B|A) is the posterior probability.

Naive Bayes’ theorem

Intuition

Bayes’ theorem which we just discussed above is so ridiculously simple and can be used in classification tasks, be it binary or multi class classification.

Consider we have a classification Machine Learning problem at hand. Suppose we have 5 features X1, X2, X3, X4 and X5 and the target variable is Y. Now we need to fit our data into this Bayes’ theorem such that it is able to predict the probability of class Y, given a certain set of 5 data points (See the application of Bayes’ rule here).

Assuming all the features are independent

We get the above equation. This can be presented as:

Here the Pi symbol is just doing the summation of the products of the likelihood probability. Now if you look it closely, the denominator or the marginal probability will be constant for all the instances. Hence we can define the proportionality for the above equation as:

Now this will give us the probabilities for both the classes. One will be higher (which we will consider as the predicted class) and other one will be lower. Hence we will take the argmax() of it to obtain that value.

Example

Now consider following simple dataset where we have 2 tables — Outlook and Temperature. Here Outlook has 3 possibilities viz., Sunny, Overcast and Rainy and the result in Yes/No as to whether the man will play tennis or not.

Outlook

Similarly the second table consists of data regarding temperature and its effect on the outcome Yes/No.

Temperature

And the total probabilities of Yes and No will look like this:

Now the problem at hand is that we need to find whether the man will play or not if the weather is Sunny and the temperature is Hot. Or in terms of probability we need to find P(Yes|Today) where Today is (Sunny,Hot). Here weather and temperature are nothing but the 2 features of our dataset. So the equation becomes:

Solving the above for both P(Yes|Today) and P(No|Today) by plugging the values from the tables above:

P(Sunny) and P(Hot) won’t be considered at is constant

So looking at the probabilities it is evident that P(No|Today) is higher, so the prediction for this instance will be ‘No’. But, these are not the class probabilities and you might have noticed that they don’t add up to 1. So we need to normalize the above conditional probabilities to get the class probabilities:

And to get the class probability for No, we can simply subtract this probability from 1. Hence probability of No will be 1–0.27 = 0.73.

Therefore the algorithm will predict the class as ‘No’.

Why so Naive?

Now, coming to the most important question (and also the title of this article :p), what is so ‘naive’ about this Naive Bayes’ Classifier?

If you had given some attention when we derived the equation for the dataset with 5 features, we had simply multiplied all the individual conditional probabilities of the individual features like P(X1|Y)*P(X2|Y)…*P(X5|Y). And we can only write the total conditional probability as the product of individual conditional probabilities of the features when we assume the features to be independent of each other. This is the ‘Naive’ assumption that we made here to make Bayes’ theorem work for us.

But, in a real life scenario, this is almost never the case when features are independent of each other. There is always some sort of dependency within the features. For example, if a feature is the age of a person and another feature is the annual salary, there is a clear dependency in most of the cases.

However, we still go on and apply this theorem for classification problems and even on text classification and it works surprisingly well!

--

--