Things You Need To Check Before Applying Naive Bayes Algorithm On Any Dataset

Janibasha Shaik

Published in

Analytics Vidhya

3 min readSep 22, 2020

All you need to know condition’s for Naive Bayes

This Article main Agenda is to discuss boundary cases of Naive Bayes.

I hope you know the theoretical concept of Naive Bayes.

Boundary Case 1: Problem Statement

First, we need to check the problem statement of the Dataset

Because Naive Bayes was originally intended to be used for classification tasks.

Note: We can use Naive Bayes for regression problem statement also but we need to do some modification in the Algorithm

Boundary Case 2: Assumption

Our aim is to find P(Class|Given Datapoint) =?

X= {x1,x2,x3,……….,xₙ} n Data points

y={c1,c2,c3,…………,Cₖ} k Classes

Naive Bayes formula for multi class classification

P(Cₖ|x1,x2,x3….xₙ)={P(x1|x2,x3,…xn,Cₖ)}*P(Cₖ) ( Bayes Theorem)

Very hard to find the probability of above equation to solve this their is one assumption

Assumption : Conditional Independence

P(A|B) =P(A) ==> A is independent of B

P(A|B∩C) = P(A|C) ===> A and B are conditional independence of given Cₖ

Using the above assumption we can easily find multi-class classification probability

P(Cₖ|X) = {P(x1|Cₖ)*P(x2|Cₖ)….*P(xₙ|Cₖ)}*P(Cₖ)

Feature x1 conditional independence of x2,x3,x4…xₙ of given Cₖ

Similarly, all the features are conditional independence of each other of given Cₖ

If given dataset features are less correlated with each other then Naive Bayes works well

Boundary Condition 3: Data Type

If your dataset is text data then Naive Bayes Work very well because

It uses the Laplace Smoothing concept to find the probability of words which are not present in the corpus.

So for the text data, Naive Bayes is the optimum model in Classical Machine learning algorithms.

Boundary Case 4: Feature Importance

Do we need to apply any feature important techniques for Naive Bayes?

The answers is No because it gives probability values if we sort the values we can directly know the important features

Important features have high probability values

Boundary Case 5: Interpretability

Naive Bayes very good for interpretability because it gives probability values so we can easily interpret them.

Boundary Case 6: Imbalance Data

If we have an imbalanced dataset Navies Bayes Bias towards majority class because of the class prior and likelihood ratio so we need to balance the dataset

Boundary Case 7: Outliers

Naive Bayes does not impact by outliers because Laplace smoothing take care of outliers

Boundary Case 8: Missing Values

Naive Bayes handle missing values well in categorical and binary features (text data ) but when it comes to Numerical features we split the data set such a way that the training dataset contains non-missing values and the test data set contains missing values

Boundary Case 9: High Dimensionality

Text classifications contain high dimensional data, Naive Bayes extensively used in Text Classification so Naive Bayes handle high dimensionality.

Conclusion :

Before applying the Naive Bayes on Dataset check these conditions with a respective given dataset if all the boundary case converges then we can create high accuracy and scalable model