Let’s build your first Naive Bayes Classifier with Python

Valentina Alto
Aug 18, 2019 · 5 min read

Naive Bayes Classifier is one of the most intuitive yet popular algorithms employed in supervised learning, whenever the task is a classification problem. I’ve been talking about the difference between supervised and unsupervised learning, as well as between classification and regression, in my previous article and, if you are not familiar with this terminology, I suggest you to have a look at it.

Here, I’m going to dwell on the (surprisingly easy) math behind the Naive Bayes Classifier and then I will implement it from scratch with Python, using the well-known Iris Dataset.

To understand this algorithm, we first need to refresh some concepts of probability theory. Indeed, Naive Bayes Classifier is based on the Bayes’ Theorem of conditional probability:

Image for post
Image for post

Where A and B are events and P(B)≠0. We talk about conditional probability of A with respect to B when we want to know the likelihood of A event, given that B event has occurred.

Let’s visualize it with a Venn Diagram:

Image for post
Image for post

Basically, once the event B has occurred, the probability space of event A is reduced to the intersection between A and B, since everything that is not B cannot occur (indeed, we already know that B has already occurred!). The situation is the following:

Image for post
Image for post

Now, all the grey area does not longer exist. Hence, we want to compute the probability of the intersection, but we have to divide it by the probability of the new domain, which is not 1 as before (the rectangular area) but the probability of B:

Image for post
Image for post
Image for post
Image for post

Now, since the same reasoning holds for the conditional probability of B:

Image for post
Image for post

And knowing that:

Image for post
Image for post

We can substitute the latter in the numerator of P(A|B), obtaining exactly the formula of Bayes Theorem:

Image for post
Image for post

Nice, but how do we apply this theorem to a classification task? In classification problems, the theorem above is used to predict the likelihood of a class, given a feature (or a vector of features). So, if I want to predict the class y of a new observation x, the algorithm will be like that:

Image for post
Image for post

Let’s consider the following example:

Image for post
Image for post

We are considering a binary classification problem with only one feature: the object of the e-mail. Then we receive a new e-mail with a given object and we want to predict whether or not it’s spam.

Imagine that the object of our new e-mail is ‘job’: how can we proceed? We want to know the likelihood of the class spam given our object, hence we have to compute some probabilities:

Image for post
Image for post

And finally compute the conditional probability:

Image for post
Image for post

So the probability of our new e-mail of being spam, given that its object is ‘job’, is 33.3%: we conclude that it is more likely that it’s not spam.

Python libraries offer three kinds of Naïve Bayes classifiers:

  • Gaussian: it assumes that features follow a normal distribution.
  • Multinomial: it is used for discrete counts
  • Bernoulli: the binomial model is useful if your feature vectors are binary (i.e. zeros and ones)

Of course, here the example refers only to one feature and two classes. However, the procedure is the same even with a vector of features and multiple classes (with the only difference of applying an argmax function to the probability output, so that the class with the highest probability is returned).

For this purpose, I’m going to use the Gaussian Naive Bayes Classifier. Let’s implement it with Python:

Let’s have a look at our dataset:

Image for post
Image for post
Image for post
Image for post

As you can see, we have 4 features and one target, which is categorical since we have 3 classes ( Setosa, Versicolor and Virginica, encoded with 0, 1 and 2). Now we can build our classifier as follows:

Image for post
Image for post

As you can see, our algorithm correctly classified all the data of the test set. It is a pretty good result, actually the best we can aspire to. However, keep in mind the poor dimension of our dataset (only 150 observations).

It is normally hard to obtain a 100% accuracy on real data. Furthermore, it might also be counterproductive aspiring to an error equal to zero. Indeed, since the last goal of any ML algorithm is making reliable predictions on new, unknown data, we’d prefer an algorithm which is able to generalize rather than perfectly fit the test data. The risk of overfitting (using too many parameters in order to have a model which perfectly fit test data) might lead to a useless model, if it is not able to fit new data (you can read more about overfitting here).

DataSeries

Imagine the future of data

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store