Naive Bayes for Beginners…!

ronith raj
Nov 5 · 8 min read

This is my first attempt at writing a blog and I hope you like it.

Purpose of this Post

I have kept everything in plain English without using any jargon. The main purpose is to help you in understanding the Text classification algorithm in Machine Learning-Naive Bayes with a simple example.

What you can expect

By reading this blog completely you would learn what this algorithm is all about and applications of Naive Bayes in Machine Learning to solve real world problems.

Table Of Contents

  1. Quick Introduction to Naive Bayes
  2. Mathematics of Probability required for this algorithm
  3. Simple example: i) Training Phase ii) Testing Phase
  4. Applications of Naive Bayes
  5. Common Mistakes to Avoid
  6. Conclusion

Quick Introduction to Naive Bayes

Naive Bayes is the most popular algorithm based on Bayes theorem which is used in Text classification . Naive Bayes has a naive assumption which will be discussed later in this post. Naive Bayes is a family of probabilistic algorithms that take advantage of probability theory to predict the output of a text(sentence which comprises of multiple words) . The outputs are probabilistic which means they calculate the probability of each category for a given text, and the final output is the one which has the highest value.


Mathematics of Probability required

The Bayes theorem’s equation:

  • P(c|x) = P(c and x)/ P(x)
  • P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
  • P(c) is the prior probability of class.
  • P(x|c) is the likelihood which is the probability of predictor given class.
  • P(x) is the prior probability of predictor.

In Naive Bayes all we want to find is the posterior probability values. The posterior probability value for whichever class is highest will be the final result of the problem we solve.

Bayes’ Theorem is useful for dealing with conditional probabilities, since it provides a way for us to reverse them.

Here we can simply ignore the denominator which is predictor prior probability or some refer to it as evidence probability, because for whatever the problem consisting of many target variables , the probability of that would be same for all the classes.

After ignoring the denominators we are simply left with this equation,

Now we calculate the posteriors as mentioned in the example.

Note: If the posterior probabilities of the classes turns out to be equal then you just do some feature engineering by adding new features to the training data. This won’t happen very often with real world problems.


Simple Example

In this example we solve two-class classification problem using Naive Bayes algorithm.

i) Training Phase

In this phase we build a model by training the data and obtain all the probabilities.After training the data we evaluate the model by feeding it with some test data in testing phase. In this way we check the performance of the model.

Our training data has 5 reviews and their labels are as follows,

All the probabilities are calculated in this phase and not to confuse you with all the Mathematics I will directly show you how it is calculated in the next phase with test data.

Find the best place and try to relax before we get into calculating the Math required for this algorithm.

ii)Testing/Evaluation Phase

We have to find out which category does the sentence “Overall liked the movie” belong to. Positive or Negative? This is done by calculating the probability of the given text belonging to both the classes. Finally the sentence belongs to the class which yields highest probability.

Given sentence is pre-processed by converting everything into lower case letters and the sentence is tokenized as follows,

overall| liked|the |movie

We have to find out the following probabilities,

P(positive| overall liked the movie) and P(negative|overall liked the movie)

Being Naive

So here comes the Naive part: we assume that every word in a sentence is independent of the other ones. This means that we’re no longer looking at entire sentences, but rather at individual words. So for our purposes, “this was a fun party” is the same as “this party was fun” and “party fun was this”. P(this was a fun party) = P(this party was fun) = P(party fun was this) which is finally equal to this P(this) x P(was) x P(fun) x P(a) x P(party)

In our example we write this as:

P(overall liked the movie) = P(overall) x P(liked) x P(the) x P(movie)

Calculating Probabilities

We have to find out the following probabilities,

Since for our classifier we’re just trying to find out which label has a higher probability, we can discard the divisor /denominators — which is the same for both category of labels— so we just compare the numerators.

The final step is just to calculate the probability for both labels and see which one turns out to be larger.

Calculating a probability is just counting the number of times the words from test data are present in our training data.

First, we calculate the probability of each tag: for a given sentence in our training data, the probability that it is Positive label is P(positive) is ‘3/5’ as out of 5 sentences 3 are positive. Then, P(negative) is ‘2/5’. That’s easy enough.

Then, calculating P(overall |positive) means counting how many times the word “overall” appears in positive reviews (1) divided by the total number of words in positive reviews(17). Therefore, P(overall|positive) = 1/17.

We have a problem since overall is not seen in our negative label reviews, there are no words in the training data to count so we get a zero count which will be a problem. The word “overall” appears in negative reviews (0) divided by the total number of words in negative reviews(7). This is rather inconvenient since we are going to be multiplying it with the other probabilities and we’ll end up with zero finally.Doing things this way simply doesn’t give us any information at all, so we have to find a way around.

We do it by using something called Laplace Smoothing : we add 1 to every count so it’s never zero. To balance this, we add the number of possible words to the divisor, so the division will never be greater than 1. In our case, the possible words are — ‘i’, ’liked’, ‘the’, ‘movie’ , ’it’s’, ‘a’, ‘good’, ‘nice’, ‘story’, ‘songs’, ‘but’, ‘boring’ , ‘ending’, ‘hero’s’, ‘acting’, ‘is’, ‘bad’, ‘heroine’, ‘looks’, ‘overall’, ‘sad’ .

Since the number of possible words is 21, applying smoothing we get that P(overall|positive) = 1+1/17+21. The full results are:

Now we simply multiply,

P(positive|overall liked the movie) = p(overall|positive) x p(liked|positive) x p(the|positive) x p(movie|positive) x p(positive)= 0.0526 x 0.0526 x 0.0526 x 0.1053 = 0.000009208

P(negative|overall liked the movie) = p(overall|negative) x p(liked|negative) x p(the|negative) x p(movie|negative) x p(negative)= 0.0357 x 0.0357 x 0.0357 x 0.0714 x 0.4=0.000001302

Excellent! Our classifier gives “overall liked the movie” the positive label because the probability of P(positive|overall like the movie) is higher compared to the probability of P(negative|overall liked the movie) .


Just have a cup of tea as we are done with how to use the algorithm. Now we will see the real time applications and close this.

Applications of Naive Bayes

  • Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time.
  • Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable.
  • Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)
  • Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not.

Common Mistakes of Underflow Error

If you noticed, the numerical values of probabilities of words ( i.e p of a test word “ j ” in class c ) were quite small. And therefore, multiplying all these tiny probabilities to find product ( p of a test word “ j ” in class c ) will yield even a more smaller numerical value that often results in underflow which obviously means that for that given test sentence, the trained model will fail to predict it’s category/sentiment. So to avoid this underflow error, we take help of mathematical log as follows :

product(p(A),p(B),p(C)) — → log(product(p(A),p(B),p(C)))

log(AB) = log(A) + log(B)

So now instead of multiplication of the tiny individual word probabilities, we will simply add them. And why only log? why not any other function? Because log increases or decreases monotonically which means that it will not affect the order of probabilities. Probabilities that were smaller will still stay smaller after the log has been applied on them and vice versa.


Conclusion

  • In this post, we looked at one of the supervised machine learning algorithm “Naive Bayes” mainly used for classification. Congrats, if you’ve thoroughly understood this article, then you’ve already taken you first step to master this algorithm.
  • One of the best characteristics of the Naive Bayes Model is that you can improve it’s accuracy by simply updating it with new vocabulary words instead of always re-training it. You will only need to add words to the vocabulary and update the words count accordingly. That’s it!

References

ronith raj

Written by

Machine Learning, Deep Learning, Data Science Enthusiast.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade