# Let’s build your first Naive Bayes Classifier with Python

Aug 18, 2019 · 5 min read

Naive Bayes Classifier is one of the most intuitive yet popular algorithms employed in supervised learning, whenever the task is a classification problem. I’ve been talking about the difference between supervised and unsupervised learning, as well as between classification and regression, in my previous article and, if you are not familiar with this terminology, I suggest you to have a look at it.

Here, I’m going to dwell on the (surprisingly easy) math behind the Naive Bayes Classifier and then I will implement it from scratch with Python, using the well-known Iris Dataset.

To understand this algorithm, we first need to refresh some concepts of probability theory. Indeed, Naive Bayes Classifier is based on the Bayes’ Theorem of conditional probability:

Where A and B are events and P(B)≠0. We talk about conditional probability of A with respect to B when we want to know the likelihood of A event, given that B event has occurred.

Let’s visualize it with a Venn Diagram:

Basically, once the event B has occurred, the probability space of event A is reduced to the intersection between A and B, since everything that is not B cannot occur (indeed, we already know that B has already occurred!). The situation is the following:

Now, all the grey area does not longer exist. Hence, we want to compute the probability of the intersection, but we have to divide it by the probability of the new domain, which is not 1 as before (the rectangular area) but the probability of B:

Now, since the same reasoning holds for the conditional probability of B:

And knowing that:

We can substitute the latter in the numerator of P(A|B), obtaining exactly the formula of Bayes Theorem:

Nice, but how do we apply this theorem to a classification task? In classification problems, the theorem above is used to predict the likelihood of a class, given a feature (or a vector of features). So, if I want to predict the class y of a new observation x, the algorithm will be like that:

Let’s consider the following example:

We are considering a binary classification problem with only one feature: the object of the e-mail. Then we receive a new e-mail with a given object and we want to predict whether or not it’s spam.

Imagine that the object of our new e-mail is ‘job’: how can we proceed? We want to know the likelihood of the class spam given our object, hence we have to compute some probabilities:

And finally compute the conditional probability:

So the probability of our new e-mail of being spam, given that its object is ‘job’, is 33.3%: we conclude that it is more likely that it’s not spam.

Python libraries offer three kinds of Naïve Bayes classifiers:

• Gaussian: it assumes that features follow a normal distribution.
• Multinomial: it is used for discrete counts
• Bernoulli: the binomial model is useful if your feature vectors are binary (i.e. zeros and ones)

Of course, here the example refers only to one feature and two classes. However, the procedure is the same even with a vector of features and multiple classes (with the only difference of applying an argmax function to the probability output, so that the class with the highest probability is returned).

For this purpose, I’m going to use the Gaussian Naive Bayes Classifier. Let’s implement it with Python:

Let’s have a look at our dataset:

As you can see, we have 4 features and one target, which is categorical since we have 3 classes ( Setosa, Versicolor and Virginica, encoded with 0, 1 and 2). Now we can build our classifier as follows:

As you can see, our algorithm correctly classified all the data of the test set. It is a pretty good result, actually the best we can aspire to. However, keep in mind the poor dimension of our dataset (only 150 observations).

It is normally hard to obtain a 100% accuracy on real data. Furthermore, it might also be counterproductive aspiring to an error equal to zero. Indeed, since the last goal of any ML algorithm is making reliable predictions on new, unknown data, we’d prefer an algorithm which is able to generalize rather than perfectly fit the test data. The risk of overfitting (using too many parameters in order to have a model which perfectly fit test data) might lead to a useless model, if it is not able to fit new data (you can read more about overfitting here).

Written by

Written by