Understanding Naive Bayes Algorithm: A Comprehensive Guide

7 min readJul 14, 2024

Naive Bayes is a powerful and intuitive algorithm widely used in supervised learning for classification tasks. Despite its simplicity, it often performs surprisingly well and is particularly useful for text classification problems and other tasks.

Why is it Called “Naive”?

The “naive” in Naive Bayes comes from the algorithm’s assumption of independence between every pair of features, which means that the algorithm assumes that each feature (or characteristic) of the data is independent and unrelated to the others. In reality, this assumption is rarely true; features often exhibit some level of dependency. However, this simplification makes the computation much more manageable and still provides good results in practice.

What is Supervised Learning?

Supervised learning is a type of machine learning where the algorithm is trained on a labeled dataset. This means that each training example is paired with an output label. The goal is to learn a mapping from inputs to outputs that can be used to predict labels for new, unseen data.

Naive Bayes for Classification

Naive Bayes classifiers are a family of probabilistic classifiers based on Bayes’ theorem. They are particularly effective for classification problems, where the task is to assign a label to an input sample.

Bayes’ Theorem

Bayes’ theorem describes the probability of an event based on prior knowledge of conditions related to the event. It is mathematically expressed as:

Where:

P(A∣B) is the probability of event A given event B (posterior probability).
P(B∣A) is the probability of event B given event A (likelihood).
P(A) is the probability of event A (prior probability).
P(B) is the probability of event B (evidence).

Naive Bayes Classification Process:

Calculate Prior Probabilities: Determine the prior probability for each class, which is the frequency of each class in the training data.
Calculate Likelihood: For each feature, calculate the likelihood of the feature given the class. This is where the “naive” assumption of independence comes into play.
Calculate Posterior Probabilities: Use Bayes’ theorem to combine the prior and likelihood to get the posterior probability for each class.
Predict Class: The class with the highest posterior probability is chosen as the predicted class.

Types of Naive Bayes Classifiers

Multinomial Naive Bayes: Suitable for discrete data like text classification, where features represent the frequency of words.
Gaussian Naive Bayes: Used when features are continuous and assumed to follow a Gaussian distribution.

Multinomial Naive Bayes Example:

Now, let’s understand by considering an example of loan approval prediction:

Step 1: Calculate Prior Probabilities

The prior probability is the probability of each class in the training set.

Total documents: 10
Yes (Loan Approval) documents: 6
No (Loan Approval) documents: 4

Step 2: Calculate Likelihoods for Each Feature Value:

The likelihood P(w∣c) is the probability of each feature occurring in each class.

Age:

Income:

Student:

Step 3: Calculate posterior probability:

Calculate Posterior Probability for Yes (Loan Approval):

Calculate Posterior Probability for No (Loan Approval):

Step 4: Compare Posterior Probabilities:

Since P(No∣Youth, Medium, Yes) is higher than P(Yes∣Youth, Medium, Yes) so, the loan will not be approved for the new instance with the given features Age = Youth, Income = Medium, and Student = Yes.

Gaussian Naive Bayes Example:

Consider an example of Predicting Loan Approval with Gaussian Naive Bayes:

Step 1: Calculate Prior Probabilities:

The prior probability P(c) is the probability of each class in the training set.

Step 2: Calculate Mean (μ) and Variance (σ²) for Each Class and Feature:

Income:

Credit Score:

Step 3: Calculate Probability Density Function (PDF):

Formula is given as,

Calculation for New Instance (Income = 55k$, Credit Score = 700)

For Loan Approved = Yes:

For Loan Approved = No:

Step 4: Comparison and Prediction:

Since P(Yes∣Income=55K$, Credit Score=700) is significantly higher than P(No∣Income=55K$, Credit Score=700), we predict that the loan will be approved for this applicant.

Zero Probability Issue:

Suppose during training, a certain class (like “Loan Approved = No”) never had a specific combination of feature values (e.g., Income = 55 K$ and Credit Score = 700). This means that P(Income=55∣No) and P(Credit Score=700∣No) could be zero because no instances in the training data had those exact values for “Loan Approved = No”.

When calculating the posterior probability P(No∣Income=55, Credit Score=700), if any of the individual conditional probabilities P(Income=55∣No) or P(Credit Score=700∣No) is zero, it will make the entire posterior probability zero.

Solution:

So, the most common and used solution is Laplace smoothing.

Suppose we have a dataset of emails labeled as either “Spam” or “Not Spam” based on whether they contain the word “lottery” and their length in words. Here are some data points:

Now here comes the zero probability issue,

To address this issue we will apply Laplace smoothing.

where,

count(xi,y) is the number of times feature xi appears in instances of class y.
count(y) is the total number of instances of class y.
∣V∣ is the number of possible values of feature xi.
α is the smoothing parameter (typically 1 for Laplace smoothing).

Now, applying Laplace smoothing.

Now, P(lottery=No∣Spam) is 0.250, which is a small non-zero probability.

Python Implementation:

Advantages of Naive Bayes:

Simplicity: It’s easy to understand and implement, making it ideal for quick deployment.
Efficiency: It performs well with large datasets and high-dimensional feature spaces because of its simplicity and independence assumption.
Works with Categorical Data: It handles categorical data well, making it suitable for tasks like text classification and document categorization.
Handles Missing Data: It can handle missing data by ignoring it during probability estimation, which is useful in real-world datasets.

Disadvantages of Naive Bayes:

Assumption of Independence: It assumes that features are independent, which may not hold true in many real-world scenarios.
Limited Expressiveness: It may not capture complex relationships in the data compared to more sophisticated algorithms.
Sensitive to Outliers: It can be sensitive to outliers because it doesn’t consider the distribution of data points or relationships between features.
Requires Well-Represented Data: It performs best when the training data is large and covers all possible outcomes adequately.

When to use Naive Baye’s :

Quick Implementation Needed: You need a fast and easy-to-implement algorithm for classification tasks
Large Feature Space: The dataset has a large number of features or dimensions.
Text Classification: Tasks involve text data, such as spam detection or sentiment analysis.
Simple Models Preferred: You prefer simpler models that perform well with relatively straightforward data distributions.
Early Prototyping: It’s used in the initial stages of modeling or prototyping due to its simplicity and speed.

Hope it helps!
Feel free to reach out for suggestions and queries you have https://www.linkedin.com/in/md-rayyan/