Naive Bayes Classifier From Scratch | Part 1 (The Theory)

Harry Cao
7 min readJan 11, 2019

--

A classifier, in a nutshell, is a model used to categorize objects based on certain features. Naive Bayes Classifier is popularly known to be a probabilistic classifier using Bayes’ Theorem. Chances are if you’re reading this, you are already familiar with Bayes’ Theorem. If not, don’t worry, I’ve got your back. The first part, The Theory, will get you up and ready.

Naive Bayes Classifier is extremely useful for Sentiment Analysis, especially when the training dataset is small. In this tutorial, we go through how to build a Naive Bayes Classifier model. Our model will be used to label a comment positive or negative.

Naive Bayes Classifier is a machine learning model commonly used for categorizing objects

In Part 1, we get ourself ready with all the theory. In Part 2, we will build our classifier from scratch in Golang. Stay tuned!

Table of Contents:

Considering

  • A and B are boolean values (True or False)
  • P(A) is the probabilities of A being True.
  • P(A|B) is the probability of A being True given B
  • P(A ∩ B) or P(A and B) is the probability of both A and B are True

Conditional Probability

Logically, the probability of both A and B are True equals the product of the probability of one being True (B, for example) and the probability of the other being True (A) given the current one is True (B).

For example, P(B) = 50% andP(A|B) = 20% . B is True 50% of the time. 20% out of that, A also True, which is 10% of the time ( 20% x 50%). Which means: P(A and B) = P(B) x P(A|B) = 50% x 20% = 10%

Therefore, the conditional probability of A being True given B is True is:

Bayes’ Theorem

Derived from the first equations above, the conditional probability of A being True given B is True can also be written as:

This is Bayes’s Theorem Equation. The probability of A being True given B equal the probability of B being True given A times the probability of A is True (prior probability) divided by the probability of B is True. That’s a mouthful!

The equation is very useful for sentiment analysis. For example, assume we are trying to find out the probability of a comment being positive if it contains the word excellent. If we know the probability ofexcellent appears in a comment given that comment is positive P(execellent|positive), the probability of a comment being positive P(positive)and the probability of excellent appear in any comment P(execellent), then:

That’s good! But not good enough. We need to consider the whole sentence to know whether it’s positive, not just one single word.

If B contains multiple b1, b2, ..., bn then the probability of B being True given A is the product of all probability of each bi being True given A:

So, the Bayes’ Theorem can be rewritten as:

For example, the probability of the comment “the food is excellent” being positive equals the product of probabilities of each word appears in a comment, given that comment is positive P(the|positive) x P(food|positive) x P(is|positive) x P(excellent|positive) times the probability of a comment being positive P(positive) divided by the probability “the food is excellent” appears in any commentP("the food is excellent")

Again,

  • P(positive|"the food is excellent") is the probability of a comment being positive, given it is “the food is excellent”
  • P(the|positive) is the probability of the word the appears in a comment given that comment is positive. Similar for P(food|positive) , P(is|positive) and P(excellent|positive)
  • P(positive) (prior probability) is the probability of a comment being positive
  • P("the food is excellent") is the probability of “the food is excellent” appears in any comment.

Laplace Smoothing

The probability of each bi being True given A equals the number of times bothbi and A are True divided by the number of times any b and A are True. For example, the probability of the word excellent appears in a comment given that comment is positive equals the number of times the word excellent appears in positive comments divided by the total number of times all words appear in positive comments.

Don’t get it? OK! (Probably I’m bad at explaining this). Let consider this example, the training dataset is these 3 sentences (all subsequent examples will use this dataset):

  • “The restaurant is excellent”: Positive
  • “I really love this restaurant”: Positive
  • “Their food is awful”: Negative

Then count(excellent, positive) = 1 , which means excellent appears once in positive comments. count(all word, positive) = 9, which is the total length of 2 positive sentences. So P(execellent|positive)= 1/9 = 11%, or given a positive comment, 11% chance it contains the word excellent.

But there’s a bug. Yeah, there’s always a bug! What is the probability of a comment containing the word awesome, given it is positive? Yes, the answer is 0%. WHAT??? This is because the word awesome never appears in the dataset. So we don’t have any knowledge about that. When the numerator equals to 0 count(awesome, positive) = 0, no matter what the denominator is, the fraction P(awesome|positive) = count(awesome, positive) / count(all words, positive) = 0% (3rd-grade math comes in handy sometimes. Thanks 3rd-grade teacher!)

I don’t know about you but awesome sounds very positive to me. So the fact the 0% chance a positive comment contains that word doesn’t sound right. The problem becomes even bigger when calculating the probability of a comment containing awesome being positive. Applying the equation from the previous section: P(positive|"this restaurant is awesome") ~ P(awesome |positive) ~ count(awesome, positive) = 0 . So, any comment containing awesome or any other word that doesn’t appear in the dataset, will result in 0% being positive (and negative as well).

Taken this into account, we apply Laplace Smoothing to eliminate the problem of a probability being 0 just because the word is not in the dataset. We add a constant number α to the numerator and α x count(distinct_b) to the denominator. Where count(distinct_b) is the number of distinct words in the dataset (11 for the dataset above)

Usually, we apply α = 1 for simplicity

For the word awesome: count(awesome, positive) = 0, count(all words, positive) = 9 and count(distinct words) = 11. We have P(awesome|positive) = (0 + 1)/(9+11) = 1/20 = 5%. So, 5% chance we’ll see the word awesome appears inside a given positive comment.

Apply The Theory

So now we have the theory in our pocket, let’s see whether"this restaurant is awesome" is a positive or negative comment. If you still don’t understand the theory, don’t worry, this final example will piece everything together (I promise!)

The probability of a comment being positive, given it is “this restaurant is awesome” is:

The probability of a comment being negative, given it is “this restaurant is awesome” is:

As we can see, the denominators for 2 equations are identical. So, to compare the 2 probabilities, we just need to compare the 2 numerators (whose numerator is higher has higher probability).

First, prior probability – the probability of a sentence being positive P(positive) or negative P(negative) is easy to calculate. There are 3 sentences in our dataset, 2 positive and 1 negative, so: P(positive) = 2/3 = 67% and P(negative) = 1/3 = 33%

Now, the probability of the word this appears in a sentence given that sentence is positive is:P(this|positive) = (count(this, positive) + 1) / (count(all words, positive) + count(distinct words)) = (1 + 1)/(9 + 11) = 2/20 = 10% .

Similarly, P(restaurant|positive) = 3/20 = 15% , P(is|positive) = 10% , P(awesome|positive) = 5%

The probability of the word this appears in a sentence given that sentence is negative is: P(this|negative) = (count(this, negative) + 1) / (count(all words, negative) + count(distinct words)) = (0 + 1)/(4 + 11) = 1/15 = 7% .

Similarly ,P(restaurant|negative) = 7% , P(is|negative) = 13% , P(awesome|negative) = 7%

So numerator(P(positive|"this restaurant is awesome")) / numerator(P(negative|"this restaurant is awesome")) = (10% x 15% x 10% x 5% x 67%) / (7% x 7% x 13% x 7% x 33%) = 3.4 . This means the sentence “this restaurant is awesome” is more likely to be positive by 3.4 times.

Summary

In this part 1, we have been going through the theory behind a Naive Bayes Classifier. We have proved Bayes’ Theorem with Conditional Probability. Then we used Laplace Smoothing to fix a big problem in Bayes’ Theorem, so it can be used for Sentiment Analysis.

What’s Next? In part 2, we will build the classifier from scratch using Golang.

I hope you have enjoyed the tutorial and learned something new today! If you have, please consider give it a clap 👏 and follow me to never miss future tutorials. Happy coding! 🖥

--

--