Mind Mapping Naive Bayes algorithm

Published in

Analytics Vidhya

6 min readApr 19, 2020

I stumbled upon the Mind Mapping technique which is a visual thinking tool that helps to better organise and structure information. Using this technique will make things easier for our brain to store and retrieve information when needed.

I often felt a lack of clarity in the high-level picture of how sub-concepts fit together. Mind Mapping technique is helping me a lot. I will highly recommend to do your research and use this technique if you feel so.

And if you wanted to try this technique, there could be two ways, either you could do it traditionally using pen/paper or you could use online tool which can be web or desktop-based. There are pros and cons of both the approaches but I don't want to list them here because this article is not about mind mapping but how I mind mapped Naive Bayes Algorithm.

Let’s get started now, so this post will be for beginners who are exploring the Machine Learning algorithms and definitely Naive Bayes algorithm will be on the top of your list, it is because of its simplicity and its ability to make a real-time inference.

If you are already familiar with Naive Bayes, you could spend a couple of minutes going through the summary in below mind map, otherwise, I would recommend to go through the rest of article first and then you can come back here.

What is Naive Bayes Algorithm?

It is one of the simplest algorithms for predictive modelling which is used for binary or multi-class classification problems. It is called Naive Bayes because Bayes Theorem is the basis of this algorithm, Bayes theorem is one of the most popular theorems in probability and in machine learning which I expect you to know, but if not don't worry we will cover it in brief soon. And we call this algorithm Naive because of its assumption which says that the features/variables are conditionally independent of each other, however, most of the times this is not true in realistic scenarios. But surprisingly this algorithm works well.

You can read more about this assumption by going through links in the “Further Reading” section.

Bayes Theorem

Let’s see a bit about probability basics first

Probability — It’s a chance of an occurrence of an event.
Joint probability — 𝑷(𝑨 ∩ 𝑩) = 𝑷(𝑩|𝑨) ∗ 𝑷(𝑨), simply it is a probability of occurring of event A and B together.
Conditional probability — 𝑷(𝑨|𝑩) = 𝑷(𝑨 ∩ 𝑩)/𝑷(𝑩), probability of event A happening given that event B occurs.

Now Bayes’ Theorem is stated mathematically as — 𝑷(𝑨|𝑩) = 𝑷(𝑩|𝑨) ∗ 𝑷(𝑨)/𝑷(𝑩) if you notice this is derived using conditional and joint probability.

In probability theory and statistics, Bayes’ theorem (alternatively Bayes’ law or Bayes’ rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event — Wikipedia

Let’s move away a bit from mathematics and get back to algorithms, now the notation we will use for Bayes’ Theorem is — P(Ci/X)=P(X/Ci)*P(Ci)/P(X), here Ci denotes the classes/target and X denotes the features of data points.

NOTE: When comparing the posterior probability of two class we can skip the denominator part(P(X)) as it will be same for both the class.

Common terminologies:

P(Ci) as prior probability, it is the probability of an event occurring before the collection of new data.
P(X|Ci) as likelihood function, It tells the likelihood of data point occurring in a class.
P(Ci|X) as posterior probability, which is calculated based on prior probability and likelihood function.

A more intuitive way to think about them:

The prior probability incorporates our ‘prior beliefs’ before we collect specific information.
The likelihood function updates our ‘prior beliefs’ with the new information.
The posterior probability is the final outcome which we get after applying likelihood function to prior probability.

Framing the problem

Suppose we have below dataset and we want to answer a question — “What is the chance of mushroom being edible when its cap shape is convex?”

Let’s frame the problem in terms of probability, we want to find P(edible=yes|shape=convex), and we know that as per the Bayes theorem this will be — P(shape=convex|edible=yes)*P(edible=yes)/p(shape=convex)

class probabilities — P(edible=yes), p(shape=convex)
conditional probabilities — P(shape=convex|edible=yes)

What will your model learn?

We need to know the class and conditional probabilities beforehand to answer the question — “What is the chance of mushroom being edible when its cap shape is convex?”. This is what our model will learn and it is these probabilities that will be written to disk.

For our dataset the probabilities can be calculated simply as below:

Class probabilities — P(edible=yes) = count(edible=yes)/total count
Conditional probabilities — P(s=convex|edible=yes) = count(s=convex & edible=yes)/p(edible=yes)

Here class probability is prior probability and conditional probability is our likelihood function.

How does the naive assumption help? Why do we need it?

Let’s ask a more complex question this time which involves two feature

— “What is the probability of mushroom being edible if its shape is convex and the surface is scaly?” i.e we need to find out — P(edible=yes|m=(convex, scaly)) and it will be

— P(m=(convex, scaly)|edible=yes)*P(edible=yes)

Recall our naive assumption, it states that the variables are conditionally independent. It is due to this assumption that we can rewrite the above probability equation as

P(m=convex|edible=yes)*P(m=scaly|edible=yes)*P(edible=yes)

Now the above probabilities can be simply calculated by counting the data points, imagine when we have features let’s say more than 20 and think how this assumption is simplifying our calculation.

Refer links in “Further Reading” section to know more about Conditional Independence and Naive Bayes assumption.

Laplace smoothing

So again let’s ask one last question

— “What is the probability of mushroom being edible if its shape is convex and the surface is fibrous?”

Go back to our dataset and notice that we don't have any data point where the mushroom surface is fibrous. Think for a minute on how this can cause a problem.

We want to know — P(edible=yes|m=(convex, fibrous)), which will be equal to

— P(m=convex|edible=yes)*P(m=fibrous|edible=yes)*P(edible=yes)

If we don’t have any data point with surface type = “fibrous” then the term P(m=fibrous|edible=yes) will be 0 and it will result in the whole term being calculated as 0. This is not good because our model will give inconclusive results.

This problem can be solved by Laplace Smoothing technique and it's very simple and intuitive, we just have to add 1 to avoid any value being 0 and adjust the count correctly while calculating probabilities. Refer below probability calculations for more clarity.

This is how Laplace smoothing helps in avoiding the problem, also do note it’s not necessary to add 1, we can add any value X to numerator and n*X to denominator.

Things to keep in mind.

Naive Bayes is a probabilistic classifier and it will return probabilities not class labels, users will need to identify the class based on probabilities.
We were talking only in terms of the discrete dataset, this algorithm won't simply work for continuous variables. We need other ways to make it work on continuous data. Refer links in the “Further Reading” section.
Naive Bayes is faster because it has to calculate the probabilities and there is no need to optimise the parameters/coefficients using optimization algorithms which is time taking.

I hope you get the idea about the Naive Bayes algorithm, now you can go back to the mind map which I have created and I will recommend you to use Mind Mapping technique whenever you are learning new concepts and remember it’s not limited to this only, you can use it for anything complex enough in your mind which needs to be organised. Your brain will thank you a lot for this.