# ML Series7: Bernoulli Naive Bayes

A Probabilistic Approach to ML & Naive Bayes is not Bayesian

Naive Bayes is a simple and efficient algorithm for solving a variety of classification problems. It is easy to build and particularly useful on large datasets. **More importantly, this model introduces a probabilistic approach to understand machine learning.**

Per probabilistic approach, suppose we have N class labels y = {c1, c2,…cn}, ƛ the loss of mis-classifying cj to ci, and x our sample. We have conditional risk

and try to find the best h: X -> Y to minimize overall risk. h is our classifier.

From above definition, we can see the key is to find **P(c|x)**. To find P(c|x), there are two strategies: **Discriminative models **and** Generative Models**. Given x, Discriminative models prediction of c directly.(examples include Logistic regresison, Decision tree, and SVM) On the other hand, Generative models transform **P(c|x) using Bayes Theorem**

# Derive Naive Bayes From MLE

Since P(x) is irrelevant to classification, but only a normalizer, we will ignore for now. Therefore, our goal here is to find most likely c ∈ {1…k} that maximize P(c)P(x|c). With d = the number of attributes, and under the **Attribute Conditional Independence Assumption, **the likelihood function is

To estimate the values of two q_s, Naive Bayes uses **MLE(Maximum Likelihood Estimation)****, **which tries to find the points in parameter space that maximize the likelihood function.

It’s important to know how to use MLE. Now the estimation splits into **two parts**. Since they are not dependent on each other, we can simply maximize them one-by-one.

Proof of getting final estimators mentioned above. The last step of both parts.

The final estimators are

There are two schools of thoughts in statistical inference: **Frequentist** and **Bayesian. **You may think since we use Bayes Theorem, Naive Bayes is a Bayesian inference. **It’s no the case**, because it does not make assumption of any prior distribution, it uses MLE to derive its parameters. So, it is still a Frequentist inference.

As you can see, the final parameters are easy to calculate and straightforward. Because of the algorithm’s strong assumption **“Attribute Conditional Independence” and simplicity**. It is named Naive. Below are some resources I found useful:

# Interview Questions

- How to optimize using Lagrange Multiplier?
- How to use MLE?
- How to use Bayes conditional probability theorem?
- Difference between Frequentist and Bayesian, what makes naive bayes frequentist?

**Thanks for reading the article! Hope this is helpful. Please let me know if you need more information.**