Intuition behind Naive Bayes algorithm & Laplace(Additive) smoothing

Published in

Analytics Vidhya

7 min readJan 1, 2020

In this Blog, I’ll describe the intuition behind the “Naive Bayes algorithm”. Let’s get familiar with some probability basics before moving into algorithm.

If you’re comfortable with these prerequisites, then start from “Naive-Bayes algorithm” part.

1) Prerequisites:

I’ll give a quick brief over these topics.

a) Independent Events: If the occurrence of one event doesn’t affect the occurrence of another event, then those events are considered as independent.

Ex: A= observing 6 on dice1, B = observing 5 on dice2, both are independent since the dice1 outcome doesn’t affect dice2.

P(A ∩ B) = P(A) * P(B)

Similar to above instance, what if in case, we have only one dice

A=P(rolling 6 in dice1) and B=P(rolling 5 in dice1) then,

P(A ∩ B) = 0

As both never occur at same time and same dice is being used for both events, these events are called as “Mutually Exclusive events”.

b) Conditional Probability: This is a measure of “probability of occurrence of one event assuming that another event has already occurred”. The probability of occurrence of A, when B has already occurred is denoted as

Joint probability of A and B — conditional probability of A and B

In the above notation is explained as follows. From the simple definition of independence, if A, B are any two independent events, then

P(A ∩ B) = P(A) * P(B) (i.e. Joint Probability)

But in “Joint probability” we don’t know, whether these events are dependent or independent. So, the independence of these 2 events is:

c) Bayes Theorem: It is used to calculate the probability of an event based on its association with another event. It assumes all events are conditionally independent.

After substituting “joint probability p(A∩B)” this in the above equation,

d) Conditional Independence: A and B are said to be conditionally independent given C, if and only if

Prob. of (A∩B) given C

Where A, B are assumed to be independent.

Ex: A= Person p1 probability of going late to home,

B= Person p2 probability of going late to home,

C= There is a storm in the city.

First, solve the conditional probability of independent events A, B.

But, assume ‘C’ has given, So P(Both goes late to home|it’s stormy) is

Therefore, A, B are independent, given a condition ‘C’ (i.e, conditionally independent).

Suppose if event C= “There is a storm in city and p1, p2 lives in same locality & uses same transport”

then P(A|C) and P(B|C) are no more conditionally independent, because A,B are dependent now, since they’re from same locality.

e) Conditional independence on multiple events:

This derivation is going to be crucial, while we solve derivation of “Naive Bayes algorithm”.

Now, Let’s see how to extend the “conditional independence for multiple events” A,B,C,D,E,F……& on given some condition.

Assume some independent events A,B,C,D,E,F and given on a condition that event ‘Z’ has already occurred, as per conditional independence definition, all these becomes conditionally independent.

conditional probability of events A,B….F given event Z

Finally! using these concepts, we reach the derivation of “Naive Bayes algorithm”. Let’s apply this on our training data set.

2) Naive Bayes Algorithm:

In Machine learning “Naive Bayes classifiers” are a family of simple probabilistic classifiers based on applying Bayes theorem with strong (naive) independence assumptions between the features.

All Naive Bayes classifiers assume that the value of a one feature is independent of the value of any other feature, on given a condition(i.e. class label).
For instance, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.

Since, it assumes independence of events, Naive Bayes model serves as a benchmark model in most of the cases.

If you’re excited in derivation part, directly move to ‘example’ section just below this Derivation part.

3) Derivation:

Assume that we have a training data of ’n’ observations with 4 features F(1–4) which are discrete variables and 1 class label which has ‘k’ classes (here yes/no so,k=2).

If a new observation comes from test data, let that query point without class label Yq & point Xq<xq1, xq2, xq3, xq4>. The task is to predict the class label ‘Yq’(yes/no), for given Xq<xq1, xq2, xq3, xq4>.Each components of Xq are values for their corresponding features.

Calculate the Probability of class label ’Yk’ (yes/no)given all features,

For all class labels Yyes, Yno, the denominator is constant. The numerator is nothing but joint probability, which is

Now, start solving this as Probability of all features given class label ’Yk’ given all features,

The class label of Yq is obtained from the probabilities of Xq given ‘k’ classes (yes/no), which has maximum value.

If product1 is higher than product2 then Yq is classified as “yes” and vice versa.

Note: If any of the feature is continuous, then the probability is obtained from Gaussian PDF.

Let’s keep a big full stop for this and start with an example.

4) Example:

The 4 features <f1, f2, f3, f4> are <‘age’, ‘income’, ‘yes’, ‘credit_rating’>.

If we have a query point from the test data Xq<youth, high, yes, excellent> predict whether buys a computer or not?

Step 1: Calculate the posterior probability for buys_computer = yes , given Xq <youth, high, student, excellent>.

Prob. of buys_computer =yes, given all features

Step 2: Similarly find the posterior for buys_computer = no.

Multiplying all these likely hoods, the posterior of buys_computer=no is 0.0102

Therefore, max(0.0071, 0.0102) shows, predicted class label=”no”, So that student doesn’t buy computer.

Note: Probability of evidence (i.e, denominator/ ground truth) is neglected since it is constant through out the calculations.

5) Laplace (or) Additive Smoothing:

This is introduced to solve the problem of zero probability - “If query point contains a new observation, which is not yet seen in training data while calculating probabilities”.

Let’s deal this with same dataset, but with a different query point Xq.

Xq<kid, high, yes, excellent>

Precautions:

‘α‘ should not be too high or too small, should be chosen properly by taking “Bias-variance” trade off into consideration.

‘α’ should not disturb the uniform probabilities that are assigned to unknown data/new observations.

Finding Optimal ‘α’:

Using elbow plot, try plotting ‘performance metric’ v/s ‘α’ hyper-parameter.

Disadvantage:

One of the biggest drawback is, due to its assumption of conditional independence it neglects the correlation between features.

*********************** THE END *************************

There are different versions of Naive Bayes, like BernoulliNB, MultinomialNB(for discrete variables), GaussianNB (for continuous variables) in sklearn with small differences in usage, which is understandable through the sklearn’s documentation.

References :

· https://en.wikipedia.org/wiki/Independence_(probability_theory)

· https://en.wikipedia.org/wiki/Conditional_probability

· https://en.wikipedia.org/wiki/Bayes%27_theorem

· https://en.wikipedia.org/wiki/Conditional_independence

· https://math.stackexchange.com/questions/248911/conditional-probability-pa-intersect-b-intersect-c

· https://en.wikipedia.org/wiki/Additive_smoothing