What Everybody Ought To Know About Naive Bayes theorem

JEEVANSINGH CHAUHAN
10 min readAug 10, 2020

--

Naive Bayes Algorithm For Machine Learning

Naive Bayes: A classification algorithm under a supervised learning group based on Probabilistic logic. This is one of the simplest machine learning algorithms of all. Logistic regression is another classification algorithm that models posterior probability by learning input to output mapping and creates a discriminate model.

  • Conditional Probability
  • Independent events Vs. Mutually exclusive events
  • Bayes theorem with example
  • Naive Bayes Algorithm
  • Laplace smoothing
  • Implementation of Naive Bayes Algorithm with Sci-kit learn
  • Pros & Cons
  • Summary

Conditional Probability

Conditional Probability is a measure of the probability of an event occurring given that another event has occurred.

Suppose, Ramesh wants to play cricket but he has to do his work, so there are two conditions first one is Ramesh wanna play cricket P(A) and the second one is he has to do his work P(B). So what is the probability that he can play cricket given that he already has a work P(A|B).

For example,

Event A is drawing a Queen first, and Event B is drawing a Queen second.

For the first card the chance of drawing a Queen is 4 out of 52 (there are 4 Queens in a deck of 52 cards):

P(A) = 4/52

But after removing a Queen from the deck the probability of the 2nd card drawn is less likely to be a Queen (only 3 of the 51 cards left are Queens):

P(B|A) = 3/51

And so:

P(A⋂B) = P(A) x P(B|A) = (4/52) x (3/51) = 12/2652 = 1/221

So the chance of getting 2 Queens is 1 in 221 or about 0.5%

Independent events Vs. Mutually exclusive events

P(A|B) is said to be Independent events if and only if events are occurring without affecting each other. Both events can occur simultaneously.

P(A|B) = P(A)

P(B|A)=P(B)

Let suppose there are two dies D1 and D2

P(D1 = 6 | D2 = 3) = P(D1 =6)

Then there is no relationship between both two events occurrence. Both of the events are independent of each other.

There is no impact of getting 6 on D1 to getting 3 on D2

Two events are mutually exclusive or disjoint if they cannot both occur at the same time.

Suppose,

The event is on D1 we wanna get 3 and same on the D1 we can’t get 5 because we already got 3 on D1.

P(A|B)=P(B|A)=0

P(A ∩ B) = P(B ∩ A) = 0

It means both the events are cannot occur simultaneously.

Bayes Theorem with example

Bayes Theorem Equations

With this Image we can clearly see that the above given equation proved through the given steps.

Suppose we have 3 machines A1,A2,A3 that produce an item given their probability & defective ratio.

p(A1) = 0.2

p(A2) = 0.3

p(A3) = 0.5

B is representing defective

P(B|A1) = 0.05

P(B|A2) = 0.03

p(B|A3) = 0.01

If an item is chosen as random then what is probability to be produced by Machine 3 given that item is defective (A3|B).

P(B) = P (B|A1) P(A1) + P (B|A2) P(A2) +P (B|A3) P(A3)

P(B) = (0.05) (0.2) + (0.03) (0.3) + (0.01) (0.5)

P(B) = 0.024

Here 24% of the total Output of the factory is defective.

P(A3|B) = P(B|A3) P(A3)/P(B)

= (0.01) (0.50) / 0.024

= 5/24

= 0.21

Naive Bayes Algorithm

Given a vector x of features n where x is an independent variable

C is classes.

P( Ck | x1, x2, … xn )

P( Ck | x) = (P(x | Ck) *P(Ck) ) / P(x)

Suppose given we are solving classification problem then we have to find two conditional probability P(C1|x) and P(C2|x). Out of these two binary classification probability which is highest we use that as maximum posterior.

P(Ck ,x1 , x2, …, xn)=P(x1 , x2, …, xn,Ck )

which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability:

P(x1 , x2, …, xn,Ck ) = P(x1 | x2, …, xn, Ck) * P(x2, …, xn, Ck)

= P(x1 | x2, …, xn, Ck) * P(x2, | x3, …, xn, Ck) * P(x3, …, xn, Ck)

= ….

= P(x1 | x2, …, xn, Ck) * P(x2, | x3, …, xn, Ck) * P(xn-1, | xn, Ck)* P(xn | Ck)* P(Ck)

Assume that all features in x are mutually independent, conditional on the category Ck

P(xi, | xi+1, …, xn, Ck) = P(xi, | Ck).

Thus, the joint model can be expressed as

z is a scaling factor dependent only on x1,……xn that is, a constant if the values of the variables are known

The discussion so far has derived the independent feature model, that is, the naive Bayes probability model. The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable; this is known as the maximum a posterior or MAP decision rule. The corresponding classifier, a Bayes classifier, is the function that assigns a class label

Example

Say you have 1000 fruits which could be either ‘banana’, ‘orange’ or ‘other’. These are the 3 possible classes of the Y variable.

We have data for the following X variables, all of which are binary (1 or 0).

  • Long
  • Sweet
  • Yellow

The first few rows of the training dataset look like this:

Training dataset

For the sake of computing the probabilities, let’s aggregate the training data to form a counts table like this.

So the objective of the classifier is to predict if a given fruit is a ‘Banana’ or ‘Orange’ or ‘Other’ when only the 3 features (long, sweet and yellow) are known.

Let’s say you are given a fruit that is: Long, Sweet and Yellow, can you predict what fruit it is?

This is the same of predicting the Y when only the X variables in testing data are known. Let’s solve it by hand using Naive Bayes.

The idea is to compute the 3 probabilities, that is the probability of the fruit being a banana, orange or other. Whichever fruit type gets the highest probability wins.

All the information to calculate these probabilities is present in the above tabulation.

Step 1: Compute the ‘Prior’ probabilities for each of the class of fruits.

That is, the proportion of each fruit class out of all the fruits from the population. You can provide the ‘Priors’ from prior information about the population. Otherwise, it can be computed from the training data.

For this case, let’s compute from the training data. Out of 1000 records in training data, you have 500 Bananas, 300 Oranges and 200 Others. So the respective priors are 0.5, 0.3 and 0.2.

P(Y=Banana) = 500 / 1000 = 0.50

P(Y=Orange) = 300 / 1000 = 0.30

P(Y=Other) = 200 / 1000 = 0.20

Step 2: Compute the probability of evidence that goes in the denominator.

This is nothing but the product of P of Xs for all X. This is an optional step because the denominator is the same for all the classes and so will not affect the probabilities.

P(x1=Long) = 500 / 1000 = 0.50

P(x2=Sweet) = 650 / 1000 = 0.65

P(x3=Yellow) = 800 / 1000 = 0.80

Step 3: Compute the probability of likelihood of evidences that goes in the numerator.

It is the product of conditional probabilities of the 3 features. If you refer back to the formula, it says P(X1 |Y=k). Here X1 is ‘Long’ and k is ‘Banana’. That means the probability the fruit is ‘Long’ given that it is a Banana. In the above table, you have 500 Bananas. Out of that 400 is long. So, P(Long | Banana) = 400/500 = 0.8.

Here, I have done it for Banana alone.

Probability of Likelihood for Banana

P(x1=Long | Y=Banana) = 400 / 500 = 0.80

P(x2=Sweet | Y=Banana) = 350 / 500 = 0.70

P(x3=Yellow | Y=Banana) = 450 / 500 = 0.90

So, the overall probability of Likelihood of evidence for Banana = 0.8 * 0.7 * 0.9 = 0.504

Step 4: Substitute all the 3 equations into the Naive Bayes formula, to get the probability that it is a banana.

Similarly, you can compute the probabilities for ‘Orange’ and ‘Other fruit’. The denominator is the same for all 3 cases, so it’s optional to compute.

Clearly, Banana gets the highest probability, so that will be our predicted class.

Laplace Smoothing

In statistics, Laplace Smoothing is a technique to smooth categorical data. Laplace Smoothing is introduced to solve the problem of zero probability. Laplace smoothing is used to deal with overfitting of models. Suppose in the given dataset if a word is not present at test time then we find out P(C=”Yes”|Textq) =0 or P(C=”No”|Textq).

Textq={w1,w2,w3,w4,W}

In the given training data we have w1,w2,w3 and w4 . But we don’t have W in training data so if we run P(C=”Yes”|Textq) or P(C=”No”|Textq) it we got

P(C=”Yes”|Textq) =P(C=”No”|Textq)=0…………..condition (i)

Because P(C=”Yes”|W) and P(C=”No”|W) we don’t have any probability value for this new word. So the value of probability is zero then ultimately our model is overfitting on train data because it can identify and classify the text which is available in the train data.

If the given dataset is imbalanced then the data model is underfitting and biased towards the majority class. To overcome this situation we use two different for binary classification and give more weightage to minor class to Balanced dataset.

P(C=”Yes”) we have λ1

P(C=”No”) we have λ2 …………………… condition (ii)

To deal with this condition (i) and condition (ii) we used Laplace smoothing.

By applying this method, prior probability and conditional probability can be written as:

K denotes the number of different values in y and A denotes the number of different values in aj. Usually lambda in the formula equals to 1.

By applying Laplace Smoothing, the prior probability and conditional probability in previous example can be written as:

Here λ is a hyper-parameter to deal with Overfitting and Underfitting of models.

When the value of λ Decreasing that time model is Overfitting because it gives less value to the new word or imbalance data.

When the value of λ Increasing that time model is Underfitting.

λ is used for tug-of-bar between Overfitting and Underfitting of models.

Implementation of Naive Bayes in Sci-kit learn

Applications of Naive Bayes Algorithm

  1. Real-time Prediction: As Naive Bayes is super fast, it can be used for making predictions in real time.
  2. Multi-class Prediction: This algorithm can predict the posterior probability of multiple classes of the target variable.
  3. Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers are mostly used in text classification (due to their better results in multi-class problems and independence rule) have a higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam email) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)
  4. Recommendation System: Naive Bayes Classifier along with algorithms like Collaborative Filtering makes a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not.

Pros &Cons

Pros:

  • It is easy and fast to predict the class of test data sets. It also perform well in multi class prediction
  • When assumption of independence holds, a Naive Bayes classifier performs better compared to other models like logistic regression and you need less training data.
  • It performs well in case of categorical input variables compared to numerical variable(s). For numerical variables, Gaussian normal distribution is assumed (bell curve, which is a strong assumption).

Cons:

  • If a categorical variable has a category (in test data set), which was not observed in training data set, then the model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
  • On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.
  • Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

Summary

In this Blog, you will learn about What is Conditional Probability and Different types of events that are used in Bayes theorem.

How Bayes theorem is applied in Naive Bayes Algorithm.

How Naive Bayes algorithm deals with Overfitting and Underfitting.

How to Implement algorithm with sci-kit learn.

What are the application of Naive Bayes Algorithm.

What is Procs & Cons of algorithm.

--

--