Geek Culture
Published in

Geek Culture

ML Series7: Bernoulli Naive Bayes

A Probabilistic Approach to ML & Naive Bayes is not Bayesian

Naive Bayes is a simple and efficient algorithm for solving a variety of classification problems. It is easy to build and particularly useful on large datasets. More importantly, this model introduces a probabilistic approach to understand machine learning.

Father of Bayes’ Theorem

Per probabilistic approach, suppose we have N class labels y = {c1, c2,…cn}, ƛ the loss of mis-classifying cj to ci, and x our sample. We have conditional risk

Risk is expected loss from classifying x to c.

and try to find the best h: X -> Y to minimize overall risk. h is our classifier.

From above definition, we can see the key is to find P(c|x). To find P(c|x), there are two strategies: Discriminative models and Generative Models. Given x, Discriminative models prediction of c directly.(examples include Logistic regresison, Decision tree, and SVM) On the other hand, Generative models transform P(c|x) using Bayes Theorem

P(c) is prior probability. P(x|c) is likelihood.

Derive Naive Bayes From MLE

Since P(x) is irrelevant to classification, but only a normalizer, we will ignore for now. Therefore, our goal here is to find most likely c ∈ {1…k} that maximize P(c)P(x|c). With d = the number of attributes, and under the Attribute Conditional Independence Assumption, the likelihood function is

In order to differentiate, I use q as probability sign for the right side. And use y in replacement of c.

To estimate the values of two q_s, Naive Bayes uses MLE(Maximum Likelihood Estimation), which tries to find the points in parameter space that maximize the likelihood function.

Maximize this function L to get optimal parameters

It’s important to know how to use MLE. Now the estimation splits into two parts. Since they are not dependent on each other, we can simply maximize them one-by-one.

1st Part, *important note we define [[y^(i) = y]] to be 1 if they are equal, y^(i) is ith class of y.
2nd Part

Proof of getting final estimators mentioned above. The last step of both parts.

The final estimators are

Prior Probability. D_c is number of class c in dataset D
Posterior Probability. D_c,x is number of attribute x in D_c

There are two schools of thoughts in statistical inference: Frequentist and Bayesian. You may think since we use Bayes Theorem, Naive Bayes is a Bayesian inference. It’s no the case, because it does not make assumption of any prior distribution, it uses MLE to derive its parameters. So, it is still a Frequentist inference.

As you can see, the final parameters are easy to calculate and straightforward. Because of the algorithm’s strong assumption “Attribute Conditional Independence” and simplicity. It is named Naive. Below are some resources I found useful:

  1. Use of Naive Bayes
  2. Difference between Bayesian and Frequentist with example

Interview Questions

  1. How to optimize using Lagrange Multiplier?
  2. How to use MLE?
  3. How to use Bayes conditional probability theorem?
  4. Difference between Frequentist and Bayesian, what makes naive bayes frequentist?

Thanks for reading the article! Hope this is helpful. Please let me know if you need more information.



A new tech publication by Start it up (

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Junlin Liu

Data Scientist in Finance. Take care of the memories, polish knowledge.