Understanding Machine Learning Algorithms — Naive Bayes

Srujan Tadagoppula

Published in

Analytics Vidhya

5 min readFeb 21, 2020

Naive Bayes can be used for Classification in this blog, we derive Naive Bayes from first principles.

What you will learn?

Conditional Probability
Bayes Theorem
How Naive Bayes works
Log-probabilities and Numerical Stability
Overfitting and Underfitting
How it works when we have an Outliers and Missing values
Use cases and Limitations

1. Conditional Probability

Naive Bayes works on the fundamentals of probability it works on Bayes theorem but Bayes theorem is derived from conditional probability.

Intuition:- Event B happened or occurred with or without event A. Then we are finding the occurrences of A which also occurred in B

Now let’s understand the formula how often A occurs, given that B has already occurred. We need to find the occurrences of A and divide by the total number of possibilities. When we know B occurred, the occurrences of A and exactly those occurred both in A and B occur, and since we’re assuming B occurred, the total number of possibilities are restricted.

2. Bayes Theorem

We derive Bayes theorem from conditional probability

Now, understand the formula in this formula every term has specific names let’s see

P(A|B) is a Posterior
P(B|A) is a Likelihood
P(B) is a Evidence
P(A) is a Prior

Please watch the video

3. How Naive Bayes works

Naive Bayes is used in text data a lot like whether mail is spam is not and reviews whether the review is positive or negative we will see why? we are taking example data as text data to explain the algorithms

here we are taking as an example project to explain product reviews we consider y=1 is positive and y=0 is negative. After all, the preprocessing of text for each review we have to calculate it’s prior and the likelihood

we compute the probability of y=1 given any review

we compute the probability of y=0 given any review

Prior probabilities are done then we are left with the likelihood let’s calculate the likelihood probabilities

at the end of training the data, we get priors and likelihoods

But everything fine there is a problem with this for training data we have probabilities for every word like

P(Y=1) ; P(Y=0)

P(W1|Y=1) ; P(W1|Y=0) and so on

What if we get a word at the testing time that word is not present in the training data we don’t have probability in our training data for that word here comes the problem

Here the probability of that new word(which is not present in our training data) will be zero so the entire product of probability will be zero that will be a huge problem.

Laplace or Additive Smoothing:-

Initially, when we have a word which is not present in training data the probability will be 0/n now we are adding some value alpha and alpha K

Alpha can be anything here usually we take alpha=1

When alpha is large the probability of given word Y=1and probability of given word Y=0 is equiprobable means as alpha increasing moving our likelihood probabilities to uniform distribution.

When alpha is small we are getting rid of the multiplication with zero problem.

4. Log-probabilities and Numerical stability:-

These all probability values lie between 0 and 1 when we have high dimensional data we multiply the small numbers each other which leads to Numerical Stability

To avoid this problem we use the log of each probability

5. Overfitting and Underfitting

There is one hyperparameter in Naive Bayes is alpha

When alpha is too small it leads to overfitting.
when alpha is too large it leads to underfitting because of the likelihood probabilities become uniform distribution we can say the new data point belongs to which class.

6.How it works when we have an Outliers and Missing values

Outliers:-

At training time if a word (Wj) occurs fewer than 10 times then we just ignore them
When testing time Laplace Smoothing takes care of outliers

Missing-Values:-

If we have text data there is no case of missing values if we have categorical features we assume NaN also one category for numerical features we use morel imputation

7. Use cases and Limitations

Naive Baye's fundamental assumption is conditional independence of features then it works well.
It is used in text classification extensively and categorical features
It is not used in real-valued features
Naive Bayes model is interpretable
Run time complexity and training time complexity is low we can use low latency application
It easily overfits so we should train properly with Laplace Smoothing

Thanks for reading!