Things You Need To Check Before Applying Naive Bayes Algorithm On Any Dataset
All you need to know condition’s for Naive Bayes
This Article main Agenda is to discuss boundary cases of Naive Bayes.
I hope you know the theoretical concept of Naive Bayes.
Boundary Case 1: Problem Statement
First, we need to check the problem statement of the Dataset
Because Naive Bayes was originally intended to be used for classification tasks.
Note: We can use Naive Bayes for regression problem statement also but we need to do some modification in the Algorithm
Boundary Case 2: Assumption
Our aim is to find P(Class|Given Datapoint) =?
X= {x1,x2,x3,……….,xₙ} n Data points
y={c1,c2,c3,…………,Cₖ} k Classes
Naive Bayes formula for multi class classification
P(Cₖ|x1,x2,x3….xₙ)={P(x1|x2,x3,…xn,Cₖ)}*P(Cₖ) ( Bayes Theorem)
Very hard to find the probability of above equation to solve this their is one assumption
Assumption : Conditional Independence
P(A|B) =P(A) ==> A is independent of B
P(A|B∩C) = P(A|C) ===> A and B are conditional independence of given Cₖ
Using the above assumption we can easily find multi-class classification probability
P(Cₖ|X) = {P(x1|Cₖ)*P(x2|Cₖ)….*P(xₙ|Cₖ)}*P(Cₖ)
Feature x1 conditional independence of x2,x3,x4…xₙ of given Cₖ
Similarly, all the features are conditional independence of each other of given Cₖ
If given dataset features are less correlated with each other then Naive Bayes works well
Boundary Condition 3: Data Type
If your dataset is text data then Naive Bayes Work very well because
It uses the Laplace Smoothing concept to find the probability of words which are not present in the corpus.
So for the text data, Naive Bayes is the optimum model in Classical Machine learning algorithms.
Boundary Case 4: Feature Importance
Do we need to apply any feature important techniques for Naive Bayes?
The answers is No because it gives probability values if we sort the values we can directly know the important features
Important features have high probability values
Boundary Case 5: Interpretability
Naive Bayes very good for interpretability because it gives probability values so we can easily interpret them.
Boundary Case 6: Imbalance Data
If we have an imbalanced dataset Navies Bayes Bias towards majority class because of the class prior and likelihood ratio so we need to balance the dataset
Boundary Case 7: Outliers
Naive Bayes does not impact by outliers because Laplace smoothing take care of outliers
Boundary Case 8: Missing Values
Naive Bayes handle missing values well in categorical and binary features (text data ) but when it comes to Numerical features we split the data set such a way that the training dataset contains non-missing values and the test data set contains missing values
Boundary Case 9: High Dimensionality
Text classifications contain high dimensional data, Naive Bayes extensively used in Text Classification so Naive Bayes handle high dimensionality.
Conclusion :
Before applying the Naive Bayes on Dataset check these conditions with a respective given dataset if all the boundary case converges then we can create high accuracy and scalable model