How Naive Bayes classifier works ? — Part 1
The Naive Bayes classifier is derived from a very old theorem by Thomas Bayes, called as Bayes theorem. Bayes theorem is based on probabilities and the way it is affected when there is additional information to consider. This 2 part blog series will capture the fundamental principles of probability and move towards the workings of Naive Bayes classifier in machine learning.
Part 1 will walk you through the basics of probability.
Dealing with Probability
Probability is a number between 0 and 1 saying how likely an event will occur. The lower the probability the less chances of the event occurring. While learning about probability lot of times you will hear the terms events and trials. An event is an outcome, while a trial is the process involved in getting that outcome. For eg, (Event/Trials)
- Heads on a coin/Coin flip
- Spam message/Incoming email message
- Humid weather/A single day
Now to calculate the probability of a event, we divide the number of trials in which the event occurred by the total number of trials. Say, out of 10 times a coin is flipped we get head 3 times. So the probability that heads will appear on a coin is 3/10 = 0.3 or 30%.
Common notation to denote probability is P(A), read as probability of event A. So P(Heads) = 0.3
One could infer from the above example of coin flip that P(Tails) = 0.7. This is because the probabilities of all the possible outcomes should add up to 1. So P(Heads) + P(Tails) =1, as there are only 2 outcomes in a coin flip.
One could also say that if there are only two outcomes of a event and both cannot occur simultaneously then knowing the probability of one event can reveal the probability of other. In such cases, where two event cannot occur simultaneously, the two events are called mutually exclusive events. They cannot occur at the same time and are the only possible outcomes. For eg, an email message can either be spam or ham(legitimate message). An email cannot be both at the same time. Hence spam and ham are mutually exclusive events.
Understanding Joint Probability.
Joint probability is an crucial concept to Bayes theorem. There will be cases where we are interested in calculating probabilities of non-mutually exclusive events for same trial. Say, what will be the probability that Bob will play cricket if its raining outside. Here playing cricket and raining are two isolated events where the event — playing cricket is dependent on the outcome of event whether it will rain. Another eg is whether the email message is spam if it contains the work “Discount” in the subject. Such kind of events are called dependent events and their probabilities are calculated in a different manner as described below.
Event dependence can be better explained with the help of a Venn diagram. On the left represents all spam messages without the word “Discount”, on the right are ham messages which has the word “Discount” and the intersection of both spheres are the spam messages which has the word “Discount”.
So probability that a message is spam and also has the word Discount in it, is represented as P( Spam ∩ Discount); the notation A ∩ B refers to the event in which both A and B occur.
There are also some events that are independent, i.e. the outcome of one event doesn't affect the outcome of other event. For eg. heads on a coin has nothing to do with a message being spam.If all events were independent then it would been impossible to calculate the probability on a event given another. Hence dependent events are the basis of predictive analysis. Just as presence of clouds is important for guess if it will rain, similarly appearance of word Discount helps in identifying email as spam.
Now calculating the probability of dependent events, Spam ∩ Discount is a bit complicated, but if these two events were independent then we could easily calculate it, P(Spam ∩ Discount) = P(Spam)*P(Discount).
Consider, 20% of the messages are spam and 5% of all the messages contains the word Discount, then, P(Spam ∩ Discount) = 0.2 * 0.05 = 0.01.
But we know that Spam and Discount are not independent events, as the outcome of one event affects the probability of another.
So for dependent events this problem can be solved with the help of Bayes theorem.
The notation P(A|B) is read as Probability of event A given that event B has already occurred. This is called as conditional probability, since the probability of A is dependent (that is, conditional) on what happened with event B.
By definition, P(A ∩ B) = P(A|B) * P(B), a fact that can be easily derived by applying a bit of algebra to the previous formula. Rearranging this formula once more with the knowledge that P(A ∩ B) = P(B ∩ A)results in the conclusion that P(A ∩ B) = P(B|A) *P(A) , which we can then use in the following formulation of Bayes’ theorem
Going back to out spam problem, we need to find out, P(Spam|Discount), ie. probability that email is spam given that it has the word discount in it.
P(A|B) is called the posterior probability. To solve it we create a frequency and likelihood table as shown
The likelihood table reveals that, P(Discount|Spam) is 4/20=0.2.
We know that from Bayes theorem above, that P( A∩B) = P(B|A)*P(A) = P(Discount|Spam =yes)*P(Spam) = (4/20) * (20/100) = 0.04. If we compare this value with the one that we calculated above with the false assumption of event independence, it appears that 0.04 is 4 times that of 0.01. This shows the importance of event dependence and how it can be solved with Bayes theorem.
But we are not done yet, we haven't solved the Bayes theorem completely. We need to find out
P(Spam|Discount= yes) =P(Discount|Spam = Yes)*P(Spam) /P(Discount)
…………………………= (4/20) * (20/100) / (5/100)
………………………….= 0.80
This shows a high chance that a message is spam if it has the word discount in it.
In Part-2 we will see the application of Naive Bayes classifier using R.