ML Series7: Bernoulli Naive Bayes
A Probabilistic Approach to ML & Naive Bayes is not Bayesian
Naive Bayes is a simple and efficient algorithm for solving a variety of classification problems. It is easy to build and particularly useful on large datasets. More importantly, this model introduces a probabilistic approach to understand machine learning.
Per probabilistic approach, suppose we have N class labels y = {c1, c2,…cn}, ƛ the loss of mis-classifying cj to ci, and x our sample. We have conditional risk
and try to find the best h: X -> Y to minimize overall risk. h is our classifier.
From above definition, we can see the key is to find P(c|x). To find P(c|x), there are two strategies: Discriminative models and Generative Models. Given x, Discriminative models prediction of c directly.(examples include Logistic regresison, Decision tree, and SVM) On the other hand, Generative models transform P(c|x) using Bayes Theorem
Derive Naive Bayes From MLE
Since P(x) is irrelevant to classification, but only a normalizer, we will ignore for now. Therefore, our goal here is to find most likely c ∈ {1…k} that maximize P(c)P(x|c). With d = the number of attributes, and under the Attribute Conditional Independence Assumption, the likelihood function is
To estimate the values of two q_s, Naive Bayes uses MLE(Maximum Likelihood Estimation), which tries to find the points in parameter space that maximize the likelihood function.
It’s important to know how to use MLE. Now the estimation splits into two parts. Since they are not dependent on each other, we can simply maximize them one-by-one.
Proof of getting final estimators mentioned above. The last step of both parts.
The final estimators are
There are two schools of thoughts in statistical inference: Frequentist and Bayesian. You may think since we use Bayes Theorem, Naive Bayes is a Bayesian inference. It’s no the case, because it does not make assumption of any prior distribution, it uses MLE to derive its parameters. So, it is still a Frequentist inference.
As you can see, the final parameters are easy to calculate and straightforward. Because of the algorithm’s strong assumption “Attribute Conditional Independence” and simplicity. It is named Naive. Below are some resources I found useful:
Interview Questions
- How to optimize using Lagrange Multiplier?
- How to use MLE?
- How to use Bayes conditional probability theorem?
- Difference between Frequentist and Bayesian, what makes naive bayes frequentist?
Thanks for reading the article! Hope this is helpful. Please let me know if you need more information.