# Conditional Probability | Bayes Theorem | Naïve Bayes Classifier

Have you ever wondered :

How your suspected/ marketing emails are automatically put in the junk / spam box and not the primary inbox?

How we can predict if today it will rain or not ?

If you have these questions, then just go through the article and you will be able to get a fair idea how these predictions are made which is very useful in day to day life.

In real life we all face circumstances where based on certain conditions and prior knowledge we want to predict happening of a certain event, for eg. I am a cricket fan, based on today’s weather conditions like temperature and humidity levels and prior knowledge of rains happening in these conditions I would like to calculate probability if we will have a match or not, based on that I would like to place my bets :D

With this article I will try to de-mystify how a Naive Bayes Classifier works. We will proceed as :

• Basic terminologies, events and probabilities
• Bayes theorem, basics and formula derivation
• Naive Bayes Classification with examples
• Merits, Demerits and assumptions of using Naive Bayes method

## Basic Definitions and terminology:

• Independent events: If events take place in series in such a way that happening of first event does not impact the success/ failure of second event
• For example: If we roll a dice 3 times and we are interested in calculating probability of getting 3 6’s in a row. It will be 1/6 * 1/6* 1/6 , first roll does not impact the probability of getting a 6 in subsequent rolls.
• Dependent events: If happening of one event impacts the happening of second event then we call them dependent events
• For example: If we draw four cards randomly without replacement from a deck of 52 cards, if we want calculate the probability of getting for queens in a row it will be 4/52 * 3/51 * 2/50 * 1/49. Here the probability of drawing a queen changes from 4/52 to 3/51 as we already removed a card and that too a queen, similarly it goes down to 1/49 in the 4th draw
• Conditional Probability: When we try to calculate probability on a condition, i.e. probability of happening of event A when event B has already taken place
• For example if we are draw 2 cards one by one without replacement from a deck and interested in calculating probability of drawing a queen the second draw when we know that first card was a queen

# Equation of Conditional Probability:

Let’s go through an example to have a clear picture.

Problem:

A purse contains 4,5 and 3 coins of denominations of Rs. 2,5 and 10 respectively. Two coins are drawn without replacing the first drawn coin . Find P( drawing Rs.2, then Rs.2)

Solution:

There are four Rs 2 coins and in total we have 12 coins so

P(Rs.2 coin in first draw) = 4/12 i.e. 1/3

The result of the first draw affected the probability of the second draw as after the first draw we are left with 3 coins of Rs.2 and in total now we have 11 coins.

P(Rs.2 coin for second draw) = 3/11

Finally P(drawing Rs. 2, then Rs 2 ) = (4/12)*(3/11) = 1/11

# Bayes Theorem Basics and Derivation:

• In probability theory, Bayes theorem describes probability of an event based on prior knowledge of conditions that might be related to the event
• Derivation: Bayes theorem is derived through conditional probability equation by equating P(A and B) of below mentioned equation 1 and equation 2

# Naïve Bayes Classifier:

Classification problems are like we need to predict class of y where a feature vector X also known as feature vector (X = [x1,x2,x3,x4, … ] features) is provided . So based on X we need to calculate class of y. Mathematically we can denote it as y when X has happened.

We assume here that features x1, x2 ,x3 all are independent of each other.

And as discussed above Independent events means an event that is not effected by previous event. In mathematical formula, we can write independent probability as follows.

Now we will derive the Naive Bayes classifier equation:

For all classes of Y we calculate probabilities and the class with max(P) is returned as the final class

Result = argmax{(Yi / x1 x2 x3 ..xn)} like if we have 2 classes of Y i.e. 0 and 1 then we calculate P[Y=1 / x1 x2 x3 …]and P[Y=0 / x1 x2 x3 …]

Now if P [Y=1] > P[Y=0] then 1 else 0 class is returned.

## Implementation of Naïve Bayes through example:

Suppose we are given with the data for some variables temperature, wind, humidity and outlook and we need to find out whether we will have a cricket match or not. We are provided with the below historical data and today the scenario is (Outlook=Sunny, Temp=Hot, Humidity=Normal, Wind=False). Find if we will have a match today or not .

Given data:

Step 1: Create Feature matrix and separate out the response variable

Step 2: Create frequency tables for all features to find P( Feature / Yes) and P(Feature / No), this gives probability of each outlook given match was played or not played like:

P(Sunny / Played= Yes) will be

(Match Played=YES where Outlook=Sunny ) / (Total Played = Yes) i.e. 4/9

Step 3: Find P(Y), Probability of match played= (Match played(Yes)/ Total):

From Naïve Bayes Formula

Calculate probability of playing the game:

P(YES / (Outlook=Sunny, Temp=Hot, Humidity=Normal, Wind=False)) =

P(YES / (Outlook=Sunny, Temp=Hot, Humidity=Normal, Wind=False)))=

=(2/9 *2/9 *6/9 *6/9 )* (9/14) / P(SUNNY)*P(HOT)* P(NORMAL)*P(F)

=144/(10206*Constant)

P[Yes]= 0.0141 * Constant

Calculate probability of not playing the game:

P(NO/ (Outlook=Sunny, Temp=Hot, Humidity=Normal, Wind=False)) =

P(NO / (Outlook=Sunny, Temp=Hot, Humidity=Normal, Wind=False)))=

=(2/5 *2/5 *1/5 *2/5 )* (5/14) / P(SUNNY)*P(HOT)* P(NORMAL)*P(F)

=40/(8750*Constant)

P[No]=0.0045 * Constant

Here P[Yes] > P[No], so final predicted class will be Y=YES

Assumptions of Naïve Bayes:

• All features are of equal weightage
• All the features are independent of each other

Merits of Naïve Bayes:

• Easy to implement
• Fast and needs less data to get trained
• If assumptions holds True then the performance is great

DeMerits of Naïve Bayes:

• Main imitation of Naive Bayes is the assumption of independent predictors.
• If categorical variable has a category in test data set, which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction

Written by

Written by

## Hitesh Kumar

#### Data Scientist @ OLX Group || Work experience of 6 yrs || Responsible for producing insights from data to impact business KPIs 