# Bayesian Statistics and Naive Bayes Classifier Refresher

The ability to clearly explain different Machine Learning approaches to someone without a technical background is extremely important for a Data Scientist and for those interviewing for Data Science roles. Over my next couple of posts I will try to highlight the pertinent information to know for various Machine Learning algorithms. I am going to focus on the intuition behind these algorithms over the math behind each in order to provide an easy to understand explanation. For this post I will be focusing on the Naive Bayes Classifier.

# Bayes Theorem

Before getting into Naive Bayes I will review some of the key concepts behind Bayesian Statistics. Bayes’ Theorem is a formula that tells us how to update the probabilities of a hypothesis when given an event occurs. In other words it shows the probability of a hypothesis given an event. The image shown above gives a solid summary of Bayes’ formula and each of the components. I am not going to dive deep into Bayes’ theorem as I want to focus on Naive Bayes but Data Skeptic did a great mini-podcast that explains the intuition behind Bayesian updating. I recommend checking it out before moving on to get a better grasp on the concept.

# Bayesian Statistics vs Frequentist Statistics

A common question I have come across in my research about Bayesian Statistics is the difference between the Bayesian and frequentist approaches. Both are philosophies in inferential statistics. Inferential statistics is one of the two main branches of statistics and is used to make inferences about a population based on a sample of observations from that population. The other main branch of statistics is descriptive statistics, which simply provide a summary of data (mean, median, standard deviation, etc).

The frequentist believes that probability represents long term frequencies of repeatable events (such a flipping a coin). Frequentists do not attach probabilities to hypotheses or unknown values. On the other hand, the Bayesian approach uses probabilities to represent the uncertainty in any event or hypothesis. The below article has a great explanation of the difference between these two mindsets.

In the above article he cites two quotes, one from the perspective of a frequentist and one from a Bayesian approach for the same problem. The frequentist approach can be seen on the left. The final sentence mentions the “mean as the value which is most consistent with the data”. This is referring to the Maximum Likelihood Estimate or MLE. For normally distributed data the MLE is simply the sample mean.

The Bayesian response to this problem is shown to the left. The statement that the sample data will be used to update the distribution is referring to Bayesian updating. The new data will make the probability narrower around the parameters true value through Bayes’ theorem. Instead of MLE the Bayesian approach uses MAP or Maximum A Posteriori Estimation. MAP works on the posterior distribution instead of the likelihood like MLE. Basically, MAP maximizes the posterior probability (likelihood times prior probability) and MLE maximizes likelihood. Additionally, if the prior probability is constant (rolling a die) then MAP is equal to MLE. Therefore, MLE can be seen as a special case of MAP where the prior probability is constant. The below link has an in depth look at the MLE and MAP formulas if you would like to get familiar with the math behind these concepts. Overall, the main difference between the frequentist and Bayesian approaches is the willingness to define a prior probability to an unknown value.

# Naive Bayes Classifier

Now that we have an understanding for the Bayesian framework we can move to Naive Bayes. Naive Bayes is a classification algorithm used for binary or multi-class classification. The classification is carried out by calculating the posterior probabilities and finding the hypothesis with the highest probability using MAP. Basically, it is finding the probability of given feature being associated with a label and assigning the label with the highest probability. It is referred to as naive because it assumes all features are independent, which is rarely the case in real life.

1. Easy to understand and fast to implement
2. Need less training data than logistic regression
3. Performs well for categorical input values
4. “Zero Frequency” or if a categorical variable has a category in the test set that is not present in the training set, the model will assign a 0% probability to this category making it unable to make a prediction. This can be fixed by using a smoothing method such as Laplace estimation. Laplace estimation assigns a small non-zero probability to data not in the train set. This is extremely relevant for text classification. For example if one word does not appear in the train set you do not want the classifier to lower the probability of the entire document to 0.
5. Assumption of independent predictors

# Naive Bayes vs Logistic Regression

Naive Bayes is often compared to another classification algorithm, Logistic Regression. Logistic Regression is a linear classification model that learn the probability of a sample belonging to a class and tries to find the optimal decision boundary that separates the classes.

The main difference between the two is that Naive Bayes is a Generative Model and Logistic Regression is a Discriminative Model. A Generative Model is one that tries to recreate the model that generated the data by estimating the assumptions and distributions of the model. It then uses this to predict the unseen data. For example Naive Bayes models the joint probability of feature X and feature Y and tries to predict the posterior probability based off that model. A Discriminative model is built based only on the observed data and includes less assumptions on the distribution of the data. However, it is very reliant on the quality of the data. For example Logistic Regression directly models posterior probability by learning the input to output mapping by minimizing error.

1. Naive Bayes assumes all features to be independent so if variables are correlated the predictions will be poor. Logistic Regression is better at handling correlation.
2. Naive Bayes works well on small training samples with high dimensionality (given features are independent) as it makes assumptions on prior probabilities. This is why it is commonly used for text classification.
3. Logistic Regression works much better than Naive Bayes on large data sets.

# Common Interview Questions on Bayes’ Rule

1. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you “Yes” it is raining. What is the probability that it’s actually raining in Seattle?
(Solution here)
2. Why is Naive Bayes naive? (Solution: It assumes each feature is independent from one another)