Bite-size Machine Learning — Naive Bayes Classifier

Published in

Bite-sized Machine Learning

5 min readNov 19, 2018

In this blog, I will cover:

1. What is Naive Bayes
2. How it works
3. How it is calculated
4. Why is it Naive
5. How does it get calculated, if we don’t know P(B|A)
6. How to implement Naive Bayes classifiers in Python

1. What’s Naive Bayes?

Naive Bayes Algorithm’s core is the Bayes Theorem, which is predicting the probability of event happen given some known probabilities and some observation.

2. How does it work?

Let’s use some examples to illustrate the formula in different scenarios. For example:

You are a fireman, and you want to predict the chances of fire occurrence when got a call report a smoke
Or you are a botanist, you want to predict the chance of a flower to be Jasmine when you measure the petal width/length
Or you are a Cyber security professional, you want to predict the chance of Cyber attack when you observe High Network Utilization

The smoke, petal width/length, or High Network utilization are all the ‘B’ in our formula, which is the known evidence/features.

The fire occurrence, flower types, Cyber attack or not, are all the ‘A’ in our formula, which is an event we are interested to predict.

Those three ‘when’ are the ‘|’ in the formula, which is the conditional term, meaning given what we observed so far, how big is the chance the event will happen?

3. How does it get calculated?

After gaining some intuitive around Bayes probability theorem, let’s dive in how we can calculate the probability we are interested — the P(A|B), by given perfect or not perfect information. We will use the fire and smoke example to demonstrate the calculation. Recall: You are a fireman, and you want to predict the chances of fire occurrence when got a call report a smoke — （fire|smoke）

Before we diving in, four formulas we will be heavily reused:

Perfect Scenario: assume you know every individual component in Bayes Theorem

Not-Perfect Scenario A: assume you do not know the P(smoke)

Not-Perfect Scenario B: assume we collect two features — smoke and High temp, Oh and you don’t know P(smoke|fire) or P(smoke)

4. Why Naive Bayes Naive?

Okay,Okay, Enough with the MATH! 🙄

Hope you get a taste of leveraging known probability to derive the probability you are interested. Then, you may say — okay this method seems reasonable, why it is ‘Naive’?

Remember the last of those four formulas above, which is leveraged when we have more than one feature: P(B,C|A) = P(B|A) * P(C|A)

This formula holds only when features B and feature C are independent, which is almost impossible for many cases.

In other word, when features increase or even just increase to two, Naive Bayes naively assume features are independent to each other.

5. How does it get calculated, if we just don’t know P(B|A)?

Furthermore, you may realize the bulk of the calculation is usually done on the P(B|A). In reality, we not always get a direct answer on what’s the value of P(B|A).What do we do? WE MAKE ASSUMPTIONS!

Three common assumptions on the distribution of P(B|A):
For continuous features (e.g. 1.234,2.345,3.672): Gaussian Naive Bayes assume features are distributed as Gaussian (normal) distribution.
For discrete features (e.g. 1,2,3): Multinomial Naive Bayes assume features are distributed as Multinomial distribution. like rolling the dice.
For binary features (e.g. presence or not): Bernoulli Naive Bayes assume features are distributed as Bernoulli distribution. like tossing the coin.

6. Naive Bayes Implementation using Sklearn

Gaussian Naive Bayes Example

Multinomial Naive Bayes Example

Multinomial Naive Bayes is commonly used in text classification, the features here are the distribution of words frequency. For example, we want to use four words pattern’ frequency — [“Free”, “Great offer”, “Uber”] — to predict whether an email is a Spam or not

When converting the message to count, for instance, Email1 converting to count is [2 (Free), 2 (Great offer), 0 (free)], which becomes the first row in the Text_Freq below.

Bernoulli Naive Bayes Example

Bernoulli Naive Bayes is commonly used in text classification too, the features here are the distribution of words’ presence regardless it’s frequency.
Let’s use the same email list.

Further Link:

if you are interested to know more about the detail of Bernoulli and Multinoulli distribution and the difference between, check out the following link.
if you are interested to know more about using Naive Bayes on Text Classification, refer the following link.
Library used:

import numpy as np
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB

Please feel free to leave any comment, question or suggestion if you have any thought related to this post.

Please clap if you feel the content is useful! :) Thanks you!