Naive Bayes Classifier

Published in

Analytics Vidhya

6 min readApr 3, 2021

A Little Background

As we know the Naive Bayes theorem as a Machine Learning Classifier Model, it’s derived from the Bayes Theorem which states that:

As complex as it looks, simpler it gets. The Bayes Theorem suggests that we can find the probability that event A occurs based on evidence that B has already occurred.

Now before turning on to how to find it, let’s brief up about which one is evidence and which one is a hypothesis:

A here is a Hypothesis because it’s the proposed probability of A on the basis of Evidence.

B is the Evidence because the event has already occurred and it’s independent of whether A will occur or not implying that B here is the predictor term.

Let’s consider here an example of a student. Consider that, a student can either be intelligent or not. If he/she gives an exam, it may or may not be the case that an intelligent student will pass the exam. So, it’s still an assumption or hypothesis that, “Given a student is intelligent, he/she will pass the examination” and the corresponding alternate hypothesis also needs to be considered then.

So, to find out if it stands, we need to take the help of such a conditional probability that showcases the “Probability that the student has passed the exam, provided he/she is intelligent”.

I hope you have a little prerequisite of Bayes Law which states the above where, P(A) represents the Prior Probability and term P(B|A) represents the Posterior Probability. We consider the concept of Joint Probability which is equal in 2 cases and it tallies out our equation.

Why Naive?

This is a question we get a lot whenever we speak about the Naive Bayes Theorem! The reason is very obvious really.

The Theorem is based on the assumption that the features are independent of each other. An important reason behind it is because if a feature implies a certain value for another feature, predicting the value we need will be useless.

It’s like you eat excess sugar and it’ll surely increase your weight to a high level. The weight here will be dependent on sugar intake. So, this automatically implies that you’re overweight, you don’t need to take both features! And you sure need to control your sugar intake.

So, this makes it possible that we don’t have to rely on duplicates in our training dataset to make a particular classification. So, comparing the outcome with individual fields will be appropriate, provided we have those individual fields not relating with each other in a way.

This is the reason why it’s sometimes good to be Naive! It sure makes it possible for us to perform classification without double-count evidence. Honestly, the Naive Bayes Theorem is more intelligent than Bayes Theorem, a total opposite!

Jumping back to the Naive Bayes Classifier

Let’s consider the Dataset having
Features: x1, x2, x3, x4, x5
Output: y
So here, we need to find the probability of y, i.e., the Output given every particular feature individually, i.e., P(x1|y), P(x2|y), and so on.
Now, we need to find the Probability of y, given a certain set of values of the independent features x1, x2, x3, x4, and x5.

As we’ve seen above, it’ll be P(y|X) where X represents the set or a tuple of these 5 features.

So, we can go on multiplying the Probabilities: P(xi|y) for all 5 features which will represent the term in our equation: P(B|A)!

Thus here, P(B|A) = P(x1|y)*P(x2|y)*P(x3|y)*P(x4|y)*P(x5|y)

We can turn this as a single term using the pi notation like this:

We probably are missing now, 2 important terms: The denominator P(B) and numerator P(A).

P(B) here represents P(x1)*P(x2)*P(x3)*P(x4)*P(x5) so, we don’t want to consider it here. The reason is that it’s a constant term and it’ll not change. Most importantly, what we need here is the outcome y and not the probability P(y|X).

So, we’ll just consider y on the basis of proportionality of P(y|X) i.e.,

Now, let’s focus on our goal: finding the term y!

As we know, y is a categorical feature. In this case, let’s consider a binary classification problem. So, y will have only 2 possible outcomes. Thus, we need to consider the highest probability we obtain of the features with respect to the outcome.

Thus, we simply use the argmax of the term we’ve obtained above:

And, we’re done with our predictive model of Naive Bayes Theorem, right here!

If you’re wondering how to find the P(xi|y), think it this way: it’ll be a maximum number of outcomes where x=0 provided y=1, x=1 provided y=1. (This is for a single feature x, imagine the same about the other independent features.)

Let’s jump into the code of an example to make it more clear.

Predicting the class of Iris Dataset

As we know, Iris Dataset is a highly famous Dataset for performing Classification. So, we’ll use it here. You may find it here: Iris Dataset

Let’s get started:

1] Importing the required Libraries and Dataset

import math
import random
import pandas as pdimport numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_splitdf = pd.read_csv(‘iris.csv’)

2] Encoding the flower class

label_encoder = LabelEncoder()
df[‘class’] = label_encoder.fit_transform(df[‘class’])

3] Forming bins

bins = [4.3, 5.5, 6.7, 7.9]
labels = [0, 1, 2]
df[‘sepal_l’] = pd.cut(df[‘sepal length in cm’], bins=bins, labels=labels)
bins = [2.0, 2.8, 3.6, 4.4]
df[‘sepal_w’] = pd.cut(df[‘sepal width in cm’], bins=bins, labels=labels)
bins = [1.0, 3.0, 5.0, 7.0]
df[‘petal_l’] = pd.cut(df[‘petal length in cm’], bins=bins, labels=labels)
bins = [0.1, 0.9, 1.7, 2.5]
df[‘petal_w’] = pd.cut(df[‘petal width in cm’], bins=bins, labels=labels)

4] Dropping the previous columns

df = df.drop([‘sepal length in cm’, ‘sepal width in cm’, ‘petal length in cm’, ‘petal width in cm’], axis=1)5] Replacing the missing valuesdf[‘sepal_l’].fillna(df[‘sepal_l’].mode().values[0], inplace=True)
df[‘sepal_w’].fillna(df[‘sepal_w’].mode().values[0], inplace=True)
df[‘petal_l’].fillna(df[‘petal_l’].mode().values[0], inplace=True)
df[‘petal_w’].fillna(df[‘petal_w’].mode().values[0], inplace=True)

6] Train Test Split of the Data

X = df.drop([‘class’], axis=1)
y = df[[‘class’]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

7] Checking the Priori Probabilities

# Check Priori Probabilityprob_0 = len(y_train[y_train[‘class’]==0])/len(y_train)
prob_1 = len(y_train[y_train[‘class’]==1])/len(y_train)
prob_2 = len(y_train[y_train[‘class’]==2])/len(y_train)

8] Creating a combined Training Dataset and finding the Posterior Probabilities as a dictionary pxy

X_train_trail = X_train.copy()
X_train_trail[‘y’] = y_train
pxy = {}
pxy[‘sepal_l_0’] = [len(X_train_trail[X_train_trail[‘sepal_l’]==0])/len(X_train_trail[X_train_trail[‘y’]==0]), len(X_train_trail[X_train_trail[‘sepal_l’]==1])/len(X_train_trail[X_train_trail[‘y’]==0]), len(X_train_trail[X_train_trail[‘sepal_l’]==2])/len(X_train_trail[X_train_trail[‘y’]==0])]
pxy[‘sepal_l_1’] = [len(X_train_trail[X_train_trail[‘sepal_l’]==0])/len(X_train_trail[X_train_trail[‘y’]==1]), len(X_train_trail[X_train_trail[‘sepal_l’]==1])/len(X_train_trail[X_train_trail[‘y’]==1]), len(X_train_trail[X_train_trail[‘sepal_l’]==2])/len(X_train_trail[X_train_trail[‘y’]==1])]
pxy[‘sepal_l_2’] = [len(X_train_trail[X_train_trail[‘sepal_l’]==0])/len(X_train_trail[X_train_trail[‘y’]==2]), len(X_train_trail[X_train_trail[‘sepal_l’]==1])/len(X_train_trail[X_train_trail[‘y’]==2]), len(X_train_trail[X_train_trail[‘sepal_l’]==2])/len(X_train_trail[X_train_trail[‘y’]==2])]

This is an example column, we need to do this for all the features one by one just in case you want to code more. Or, you could use a for loop for the same for all features. And, we form a dictionary pxy representing our Posterior probabilities.

Which gives the output Table:

9] Finally, creating the predict function:

def pred(X):
    sepal_l = X[0]
    sepal_w = X[1]
    petal_l = X[2]
    petal_w = X[3]
    sepal_l_col = ‘sepal_l_’+str(sepal_l)
    sepal_w_col = ‘sepal_w_’+str(sepal_w)
    petal_l_col = ‘petal_l_’+str(petal_l)
    petal_w_col = ‘petal_w_’+str(petal_w)
    prob0 = pxy[sepal_l_col][0]*pxy[sepal_w_col][0]*pxy[petal_l_col][0]*pxy[petal_w_col][0]*prob_0
    prob1 = pxy[sepal_l_col][1]*pxy[sepal_w_col][1]*pxy[petal_l_col][1]*pxy[petal_w_col][1]*prob_1
    prob2 = pxy[sepal_l_col][2]*pxy[sepal_w_col][2]*pxy[petal_l_col][2]*pxy[petal_w_col][2]*prob_2
    x = max([prob0, prob1, prob2])
    if x==prob0:
        return 0
    elif x==prob1:
        return 1
    else:
        return 2

10] Then, Check your results! Try on your end and let’s see how you performed.

A Brief End

Since we have bundled several continuous values into classes here, this is called Multinomial Naive Bayes.

Although, it’d be highly likely that this will get solved with a Gaussian Naive Bayes Model more accurately. The reason being, it’s useful for continuous variables!

We won’t discuss how it’s formulated as its formula is a little bit complex. But, it’s good to explore it for what it is.

Conclusion

The Naive Bayes Theorem is relatively easier than claimed yet, a highly effective algorithm when it comes to predictions. It’s totally dependent on how evidence rolls out in terms of probability. It’s useful for classification problems as well as for performing sentiment analysis of Data. Although, the independence criteria of the model often gets failed with real-life datasets which most often include dependent and multicollinear variables. Yet, it’s used in a few notable fields and does have a good performance when performing small-scale predictions in the Industry.