Understand Naive Bayes Algorithm with Python code — Part 1

Preethi Thakur
6 min readOct 13, 2022

--

Naive Bayes is a probabilistic machine learning algorithm based on the Bayes Theorem, used in classification tasks. It is used to measure the probability of an event occurring, given that another event has already occurred. Let’s dig further.

Bayes Theorem

Bayes theorem is used to calculate conditional probabilities. Conditional probability is defined as the likelihood of an event or outcome occurring, based on the occurrence of a previous event or outcome. Mathematically this is written as follows

  • P(A|B) is the posterior probability of class (A, target) given predictor (B, attributes).
  • P(A) is the prior probability of class.
  • P(B|A) is the likelihood which is the probability of predictor given class.
  • P(B) is the prior probability of predictor.

In simpler terms, Bayes’ Theorem is a way of finding a probability when we know certain other probabilities.

Assumption

Naive Bayes classifier works on basic assumption that each feature(predictor) makes an independent and equal contribution to the outcome.

Naive Bayes algorithm working with an Example

Here is an example to understand how Naive Bayes works. Let’s consider this weather dataset where we are classifying whether a particular day is suitable for playing golf or not, given the features of the day.

The columns represent features and the rows represent individual entries for each day. If we take the first row, we can observe that when the outlook is rainy, temperature is hot, humidity is high and it is not windy, then that day is not suitable for playing golf. So, we can make following two assumptions.

  1. The temperature being “Hot” has nothing to do with the humidity being “high” or “normal”, the outlook being “Rainy” has no effect on the winds. Hence, the features are assumed to be independent.
  2. Knowing only temperature and humidity alone we can’t predict the outcome accuratey. None of the attributes is irrelevant and assumed to be contributing equally to the outcome.

Don’t they look similar to Naive Bayes assumptions?

Let’s write the Bayes theorem equation for this weather dataset

  • y - represents target/class variable (tells, will play golf or not)
  • X - represents the parameters/features (Outlook, temp, humidity, windy) ranging from x1, x2,…..,xn. Here we have 4 features x1,x2,x3,x4

By substituting x1, x2,…..,xn for X and expanding using the chain rule we get posterior probability as,

Now, we can obtain the values for each class by looking at the dataset and substitute them into the above equation.

The posterior probability can be calculated by first constructing a frequency table for each attribute against the target. Then, transforming the frequency tables to likelihood tables and finally use the Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction.

Calculating probablities and predicting target variable

From our dataset we are constructed a frequency table for the “Outlook” feature, then we derived likelihood table as shown below, then we calculated the posterior probability of the class being “Yes”, given the “Outlook” feature as “Sunny

Now, we calculate posterior probability of class being “No” given the “Outlook” feauture as “Sunny”

We got posterior probablity of playing golf higher than that of not playing golf. Hence, we can say that when it is “Sunny” golf will be played; Similarly we can calculate probabilities of “Overcast” and “Rainy”.

The frequency and likelihood tables for all four predictors is as follows.

From these tables we can calculate individual probability for each class(Yes, No), for given predictors and find out maximum probability which will be our target variable.

For all entries in the dataset, the denominator does not change, it remain static. Therefore, the denominator has been removed and a proportionality is introduced in above equation.

In our case, the class/target variable(y) has only two outcomes, “yes” or “no”. But there can be cases where the classification could be multivariate. Therefore, we need to find the class y with maximum probability. We use argmax since it is an operation that finds the argument that gives the maximum value from a target function.

Let’s take an example to implement above equation. From this example you will understand how we use the individual probabilities that we computed above and get the class with maximum probability. So, let’s take a new record of features and find out whether this particular day is a good to play golf or not.

Below, we calculated individual posterior probabilities for class “Yes” and “No” for given different predictors(features)

We can see that, for given features/predictors, probability of class being “No” is greater than probability of class being “Yes”. Hence the golf will be “Not played” during this particular day.

Types of Naive Bayes Classifier

Multinomial

The Multinomial Naive Bayes classifier is used when the data is multinomial distributed. It is primarily used for document classification problems, i.e., a particular document belongs to which category such as Sports, Politics, Education, etc. The classifier uses the frequency of words from this document as the predictors/features.

Bernoulli

The Bernoulli classifier works similar to the Multinomial classifier, but the predictor variables are the independent Boolean variables. Such as if a particular word is present or not in a document.

Gaussian

When the predictors take continuous values instead of discrete, then the model assumes that these values are sampled from the Gaussian distribution.

Since the way the values are present in the dataset changes, the formula for conditional probability changes to,

Where μ is Mean and σ is Standard Deviation

Applications

Some​ applications that use Naive Bayes classifiers are:

  • Spam Filtering
  • Text Analysis
  • Recommendation Systems

Naive Bayes advantages and disadvantages

Advantages

  • Naive Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
  • It can be used for Binary as well as Multi-class Classifications.
  • It performs well in Multi-class predictions as compared to the other Algorithms.
  • It is the most popular choice for text classification problems.

Disadvantages

  • Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship between features.
  • If a categorical variable has a category (in the test data set), which was not observed in the training data set, then the model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.

--

--