Understanding Naive Bayes Algorithm

Jayanta Parida
4 min readSep 6, 2022

--

Naive Bayes algorithms fall under supervised machine learning algorithms that are mostly used for classification problems. It is not suitable for regression problems. The implementation of this algorithm is based on the famous Bayes’ theorem, named after English statistician Thomas Bayes, who had done a significant contribution to probability theory. Naive Bayes algorithms solve problems in real time and can handle sparse data easily. This algorithm is preferred for text analysis like spam filtering, sentiment analysis, and categorizing articles to name a few along with recommendation systems.

Bayes’ Theorem

Before going to Bayes’ theorem, let’s see some relevant terminologies.

Marginal Probability — The probability of an event from either set occurring. It is denoted by P(A).

Conditional Probability — The probability of an event occurring given that another event has taken place. It is denoted as P(A|B).

Joint Probability — The probability of an event from set A and an event from set B occurring simultaneously. It is denoted by P(A∩B).

Bayes’ theorem states P(A|B) = P(B|A)P(A) / P(B)

While applying the Naive Bayes machine learning algorithm, we are interested in the hypothesis with the largest conditional probabilities of each hypothesis rather than the exact value of the conditional probabilities of each hypothesis. Hence in the context of Bayesian Statistics, event A is referred to as a hypothesis, denoted by H, and event B is referred to as evidence, denoted by E. Now applying the same, the same equation can be rewritten as

P(H|E) = P(E|H) P(H) / P(E)

So the probability of hypothesis being true, given some evidence, we need to estimate the probability of the evidence being true, given that hypothesis holds. Lets define each terms of the equation.

Likelihood Function is the probability of the evidence being true given the hypothesis holds. In Bayes’ theorem, it is denoted by P(E|H).

Prior Probability is the probability of the hypothesis being true before seeing the evidence. It is denoted by P(H).

Posterior Probability is the probability of the hypothesis being true after seeing the evidence. It is denoted by P(H|E).

And finally Normalization Constant makes sure that, after all hypothesis have been considered and their conditional probabilities calculated, the sum of all posterior probabilities will be 1.

Ham or Spam Example

The most common example of a classification problem is Ham-Spam classification of message. Let’s say we have a training dataset of 50 ham and 50 spam messages. The marginal (prior) probabilities of an email being ham or spam is P(ham) = P(spam) = 1/2. Suppose a word called ‘winner’ is not there in any of the spam messages. So, even if an incoming message that contains the word ‘winner’, it will not be considered by the model, as the conditional probability of the message being spam would be 0. To remedify this, a smootheing parameter (alpha) is introduced and set its value to 1 (Laplace Smoothing). The purpose of this parameter is to increase the count of each word for both the categories.

Steps in Creating a Model

Let’s outline the important steps we need to do while creating a machine learning model.

  1. Create Dataframe — This is the first step to creating a data frame where all inputs and targets are organized.
  2. Data Cleansing — In this step, we need to remove the outliers, if present. Otherwise, it could misclassify the samples. Another thing we need to do in this step is to clean any null values. If the number of samples with null values is less than 5%, then it is safe to remove them, else we need to apply statistical methods to fill in those.
  3. Split Data — Split the data into training and testing sets (a very common split is 80:20). It is used to avoid overfitting.
  4. Data Wrangling — Prepare the data by applying various data wrangling methods for the classifier. For more details about data, and wrangling methods refer to Data Wrangling and EDA.
  5. Perform the Classification — Apply the appropriate classifier, fit the training data, and tune the hyperparameters.
  6. Evaluating the Performance of a Model — Once the model is created, we need to evaluate its performance of the model. We can use various metrics such as accuracy, precision, recall, and F1 score for evaluation. To find out more details about metrics go through Relevant Metrics for Classification Problems.

That’s all about the idea behind Naive Bayes Machine Learning Algorithm. Python’s sklearn library provides a variety of Naive Bayes algorithms depending on the input data — whether it is numerical or categorical, balanced or imbalanced, dense or sparse. Even though it is best suited for non-linear problems, dependencies between features are not considered and probaility estimates cannot be completely trusted.

--

--