The Good And Bad Of Naive-Bayes Classifier

Vishal Singh
4 min readAug 28, 2021

--

Hey folks, before we get started with our main topic I would like to list out some pre-requisite concepts which would make it easy for us to understand our main topic better.

  • Little bit of idea about what is Naive Bayes Classifier in ML and Bayes theorem will be great

Quick introduction to Naive Bayes Classifier?

Photo by Lucas Santos on Unsplash

Naive bayes falls under the category of classification algorithms in supervised machine learning. It is a simplistic(as the name suggests) algorithm which uses Bayes theorem for the purpose of solving classification problems and it is probabilistic in nature which means its results are in the form of the probabilities. It does makes an fundamental assumption that the features in the dataset are conditionally independent of each other which means there is no Correlation between the features, which might not be the case in real world . This is the reason why it is called “Naive ” bayes

Bayes Theorem

It states that “the probability that an event A occurs given that another event B has already occurred is equal to the probability that the event B occurs given that A has already occurred multiplied by the probability of occurrence of event A and divided by the probability of occurrence of event B”

image credits — freecodecamp

Advantages

  • It is computationally efficient and fast as its time complexity and space complexity are low

The time complexity of Naive Bayes Classifier during training phase incase of brute force approach is O(n*d*c) where “n” is the number of data points , “d” is the number of features and “c” is the number of classes. But by the help of optimization it can reduced to O(n*d) or if d is small it becomes O(n) .

Space complexity during test phase is O(d*c) because we will have to store c*d likelihood probabilities. we store them in dictionary with probabilities as keys and there values as value

Time complexity during test phase is O(d*c) . since you have to retrieve “d” feature values for each of the “c” classes from the dictionary by performing d*c lookups .It becomes super useful when dealing with very high dimensional data, such as a high dimensional corpus of text data.

Space complexity during testing stage is just O(1) as nothing has to be stored.

  • One of the best algorithm for Text classification and needs less data. Even used as a baseline model for other discriminative models like logistic regression when it comes to text classification if the assumption of feature independence holds true
  • It can handle both continuous and discrete data.

for continuous data we either discretize the continuous interval of xi into bins {b1,..,bm}{b1,..,bm} and proceed the same as discrete case, or we assume a function like Gaussian (or any other one)

  • It’s interpretability is very high
  • Extensively used for categorical data
  • Surprisingly even if our assumption of feature independence isn’t true, Naive Bayes works great in practice

see the seminal paper by Domingos and Pazzani [1], or the later work by Zhang [2]

[1] Domingos, P. and Pazzani, M., 1997. On the optimality of the simple Bayesian classifier under zero-one loss. Machine learning, 29(2–3), pp.103–130.

[2] Zhang, H., 2005. Exploring conditions for the optimality of naive Bayes. International Journal of Pattern Recognition and Artificial Intelligence, 19(02), pp.183–198.

Disadvantages

  • It assumes that features are independent

Naive bayes makes this fundamental assumption that features are mutually independent of each other. But in real life it’s almost impossible to get a dataset in which features or attributes are independent of each other .ln most cases the features do show some kind of dependency. This is why its called “Naive” Bayes because it makes this naive assumption

  • Numerical instabilities

As the likelihood probabilities of Naive Bayes are in the range (0,1) . If we have d features (let d = 100) ,we will have to multiply d features i.e. multiplying 100 decimal numbers and all of them are in the range (0,1) . It can lead us to numerical instabilities which is not a good thing for any model. Although it can be solved by using log probabilities in place of likelihood probabilities

  • Very prone to overfitting i.e. high variance in the data, if we don’t use Laplace Smoothing
  • Zero Frequency

If categorical variable has a category or feature in test data set, which was not observed in training data set and our model is not trained for that feature , then model will assign a 0 (zero) probability to that ,because it wasn’t observed during the training phase and our output will be zero as rest of the probabilities will be multiplied to this 0(zero) probability

Thank you

I hope it was useful 😃I would love to connect with you on linkedin .Click here for connecting on linkedin

--

--