Naive Bayes Algorithm with Amazon Food Reviews Analysis
Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one of the simplest supervised learning algorithms. Naive Bayes classifier is a fast, accurate, and reliable algorithm. Naive Bayes classifiers have high accuracy and speed on large datasets.
To understand the Naive Bayes algorithm first we want to know some basic concepts of probability.
Contents
- Probability
2. Conditional Probability
3. Independent Events
4. Mutually Exclusive Events
5. Bayes Theorem
6. Naive Bayes Algorithm
7. Toy Example using Naive Bayes
8. Naive Bayes Algorithm on Text data
9. Laplace (or) Additive Smoothing
10. Log-Probabilities and Numerical Stability
11. Bias-variance Trade off
12. Feature Importance and Interpretability
13. Types of Naive Bayes Classifier
14. Best and Worst cases of Naive Bayes Algorithm
15. Naive Bayes Algorithm with Amazon Food Reviews Analysis
Probability
“Probability” (or “Chance”) is how likely something is to happen. The probability of the union of Events A and B is denoted by P(A ∪ B). The probability of the intersection of Events A and B is denoted by P(A ∩ B).
The general probability addition rule for the union of two events states that P(A∪B)=P(A)+P(B)−P(A∩B)
Experiment (or) Trial: Something done with an expectation of result.
Event (or) Outcome: Result of the Experiment.
Conditional Probability
A conditional probability is the probability of an event, given some other event has already occurred.
This probability is written P(A|B), the notation for the probability of A given B. In the case where events A and B are independent (where event A has no effect on the probability of event B), the conditional probability of event B given event A is simply the probability of an event A, that is P(A).
Independent Events
The occurrence of one event does not affect the occurrence of the other. The probability of occurring of the two events are independent of each other. If A and B are said to be independent events then,
Example: You toss a coin and it comes up “heads” three times … what is the chance that the next toss will also be a “Head”?
The chance is simply ½ (or 0.5) just like ANY toss of the coin.
What it did in the past will not affect the current toss!
Mutually Exclusive Events
These are things that can’t happen at the same time. If A and B are said to be Mutually Exclusive Events then,
For example, you can’t run backward and forwards at the same time. The events “running forward” and “running backward” are mutually exclusive. Tossing a coin can also give you this type of event. You can’t toss a coin and get both heads and tails. So “tossing a heads” and “tossing a tails” are mutually exclusive.
Bayes Theorem
This is the fundamental, simple, elegant, beautiful, and useful theorem in the probability theory. Bayes’ Theorem is useful when working with conditional probabilities (like we are doing here) because it provides us with a way to reverse them.
Using Bayes theorem, we can find the probability of A happening, given that B has occurred. Here, B is the evidence and A is the hypothesis.
To know more about Bayes theorem visit here.
Naive Bayes Algorithm
Naive Bayes is the most straightforward and fast classification algorithm, which is suitable for a large chunk of data. Naive Bayes classifier is successfully used in various applications such as spam filtering, text classification, sentiment analysis, and recommender systems. It uses Bayes theorem of probability for prediction of unknown class.
In statistics, Naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong independence assumptions between the features. They are among the simplest Bayesian network models.
Proof: Abstractly, naïve Bayes is a conditional probability model: given a problem instance to be classified, represented by a vector x={x1,…,xn}, representing some n features (independent variables), it assigns to this instance probabilities.
Given a data point x={x1,…,xn} of n features, naive Bayes predicts the class Ck for x according to the probability
p(Ck∣x)=p(Ck∣x1,…,xn) for k=1,…,K
Using Bayes’ Theorem, this can be factored as
In plain English, using Bayesian probability terminology, the above equation can be written as
In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on Cand the values of the features x are given so that the denominator is effectively constant.
Using the chain rule, the factor p(x1, …, xn|Ck)p(x1,…,xn∣Ck)in the numerator can be further decomposed as
Computing each of the probability is complex and to compute we want a sufficient amount of data.
So, At this point, the “naive” conditional independence assumption is put into play.
Specifically, naive Bayes models assume that feature xi is conditionally independent of feature xj for i≠j has given the class Ck. Using the previous decomposition, this can be formulated as,
Thus,
Naive Bayes gives the probability of a data point x belonging to class Ck as proportional to a simple product of n+1 factors ( the class prior p(Ck) plus n conditional feature probabilities p(xi∣Ck)).
Since classification involves assigning the class Ck to the data point for which the value p(Ck∣x) is greatest, this proportional product can be used to determine the most likely class assignment. Specifically,
Thus, the most likely class assignment for a data point x=x1,…,xn can be found by calculating p(Ck) (xi∣Ck) for k=1,…,K and assigning x the class Ck for which this value is largest.
The discussion so far has derived the independent feature model, that is, the naïve Bayes probability model. The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable. this is known as the maximum a posteriori or MAP decision rule.
In mathematical notation, this is defined as,
where C^ is the estimated class for x given its features x1,…,xn.
For more details about Naive Bayes Proof visit here.
Toy Example using Naive Bayes
Let’s build a classifier that predicts whether I should play tennis given the forecast. It takes four attributes to describe the forecast; namely, the outlook, the temperature, the humidity, and the presence or absence of wind. Furthermore, the values of the four attributes are qualitative (also known as categorical). They take on the values shown below.
Outlook ∈ [Sunny, Overcast, Rainy]
Temperature ∈ [Hot, Mild, Cool]
Humidity ∈ [High, Normal]
Windy ∈ [Weak, Strong]
The class label is the variable, Play, and takes the values yes or no.
Play ∈ [Yes, No]
We read-in training data below that has been collected over 14 days.
The Learning Phase
In the learning phase, we compute the table of likelihoods (probabilities) from the training data. They are:
P(Outlook=o|ClassPlay=b), where o ∈ [Sunny, Overcast, Rainy] and b ∈ [yes, no]
P(Temperature=t|ClassPlay=b), where t ∈ [Hot, Mild, Cool] and b ∈ [yes, no],
P(Humidity=h|ClassPlay=b), where h∈ [High, Norma] and b ∈ [yes, no],
P(Wind=w|ClassPlay=b), where w ∈ [Weak, Strong] and b ∈ [yes, no].
We also calculate P(ClassPlay=Yes) and P(ClassPlay=No).
Classification Phase
Let’s say, we get a new instance of the weather condition, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) that will have to be classified (i.e., are we going to play tennis under the conditions specified by x’).
With the MAP rule, we compute the posterior probabilities. This is easily done by looking up the tables we built in the learning phase.
P(ClassPlay=Yes|x’) = [P(Sunny|ClassPlay=Yes) × P(Cool|ClassPlay=Yes) ×
P(High|ClassPlay=Yes) × P(Strong|ClassPlay=Yes)] ×
P(ClassPlay=Yes)
= 2/9 × 3/9 × 3/9 × 3/9 × 9/14 = 0.0053
P(ClassPlay=No|x’) = [P(Sunny|ClassPlay=No) ×P(Cool|ClassPlay=No) ×
P(High|ClassPlay=No) × P(Strong|ClassPlay=No)] ×
P(ClassPlay=No)
= 3/5 × 1/5 × 4/5 × 3/5 × 5/14 = 0.0205
Since P(ClassPlay=Yes|x’) less than P(ClassPlay=No|x’), we classify the new instance x’ to be “No”.
Naive Bayes Algorithm on Text data
Naive Bayes is a learning algorithm commonly applied to text classification.
Using Naive Bayes text classifiers might be a really good idea, especially if there’s not much training data available and computational resources are scarce.
Consider for the given text we want to classify Spam/Ham classification, therefore, there are two classes: “Spam” and “Ham” (i.e. non-spam).
we want to do some preprocessing techniques for text like remove stopwords, stemming, featurization like Bag of Words depends on the given text.
Where x is a feature vector containing the words coming from the Spam (or Ham) emails:
similarly, compute for class Y=0 given by,
To compute the probability of each class is given by
To compute the likelihood is given by
Finally, Therefore, we need to find the class Y with maximum probability. Using the below function, we can obtain the class, given the predictors.
Laplace (or) Additive Smoothing
This is introduced to solve the problem of zero probability — “If the query point contains a new observation, which is not yet seen in training data while calculating probabilities”.
The problem with Maximum Likelihood is What if we have seen no training documents with the word and classified in the class from a given dataset?
If a new word comes from the testing phase it is not present in the training stage then p(w2/y=1) and p(w2/y=0) becomes zero.
Zero probabilities cannot be conditioned away, no matter the other evidence! If we ignoring these words it’s not good for the model. To solve this problem we used Laplace Smoothing.
Let’s see how to use Laplace smoothing, consider the below image P(w’/y=1)=0 and P(Y=1) as n1.
If the word not present in the corpus then instead of assigning the zero to P(w’/y=1) assign like,
assume α=1 and k is the number of distinct values w’ can take and n1 is the p(Y=1). Often time we used α=1 so we called Laplace add-one smoothing. If α <1 then it is a Lidstone smoothing.
α‘ should not be too high or too small, should be chosen properly by taking the “Bias-variance” trade-off into consideration.‘α’ should not disturb the uniform probabilities that are assigned to unknown data/new observations.
Generalized Additive Smoothing: Laplace add-one smoothing now assigns too much probability to unseen words.More common to use α instead of 1.
If the numerator and denominator are small numbers then the impact of α as α increase is high in likelihood probabilities. So if the α increase then the likelihood probabilities will be approximately ½.
Finding Optimal ‘α’
Using elbow plot, try plotting ‘performance metric’ v/s ‘α’ (or ) using Grid search Cross-validation. Here α is hyper-parameter.
Log-Probabilities and Numerical Stability
A problem arises when m (the size of the vocabulary) gets very big. Notice that because each P(wi | Cj) term is between 0 and 1 when you start multiplying them together, the overall product starts approaching zero.
This is exacerbated by each new feature that is added: with 1,000 features, multiplying 1,000 probabilities together quickly results in an extremely tiny number, one which often cannot be represented in a double or long double.
A common workaround for this situation is to represent probabilities as log-probabilities instead. That is, instead of using P(wi | Cj) and P(Cj), we use ln P(wi | Cj) and ln P(Cj) (ln means the natural logarithm, though any logarithm works). This transformation works (and is rather elegant) for a number of reasons.
- Because a probability is always between 0 and 1, a log-probability is always between −∞ and 0 (a much wider range of numbers than between 0 and 1).
- If a <b, then ln a <ln b. This property of logarithms means that comparing probabilities to figure out which one is the biggest (as argmax does) is equivalent to comparing the log-probabilities.
- Finally, the math works out nicely because the logarithm of a product is the sum of the logarithms (that is, ln(a · b) = ln a + ln b).
Note that because we are taking the logarithm of numbers between 0 and 1, the log probabilities will always be negative. This is OK. Just do your calculations and comparisons as you would normally, a “more negative” number (meaning a negative number greater in absolute value) corresponds to a smaller probability than a “less negative” number, so you don’t have to switch the comparison.
Bias-variance Trade off
Case 1 : If α = 0 (High variance)
In this case, even for a very rare word, we are associated with some probability but on the other hand for another word which is never seen in the training phase the likelihood probability suddenly becomes zero.
This means our model is overfitting. Because for a very small change in training data, there is a very large difference in the model.
So this is a case of high variance.
Case 2: If α is a very large value (High bias)
As we have seen earlier if α then, Likelihood probabilities ≈ ½.
If we have balanced data then priors for each class also become the same. Then in such cases, the posterior probability for each class is approximately the same.
So this is a case of high bias because the posterior probabilities are the same.
When alpha is too small it leads to overfitting.
when alpha is too large it leads to underfitting because of the likelihood probabilities become uniform distribution we can say the new data point belongs to which class.
Feature Importance and Interpretability
For the Naive Bayes algorithm feature, importance is obtained from likelihood probabilities. Sort on the basis of likelihood probabilities in decreasing order the top words are that features are more important.
In Naive Bayes feature importance obtained directly from the model itself.
The naive Bayes algorithm easily gives the Interpretability.
How Naive Bayes works when we have Imbalance data, Outliers, Missing values, and so on?
Imbalance data
Naive Bayes is also impacted by an imbalanced data solution is to perform upsampling (or) downsampling.
When we have Imbalance data α impacts more for the Minority class and α less impact to Majority class.
Outliers
- At training time if a word (Wi) occurs fewer times then we just ignore them.
- When testing time Laplace Smoothing takes care of outliers.
Missing-Values
If we have text data there is no case of missing values if we have categorical features we assume NaN also one category for numerical features we use model imputation techniques.
Multiclass Classification
Naive Bayes easily do Multiclass Classification because only to compare the classes and pick class Ck for which this value is largets by maximum a posteriori or MAP decision rule.
Similarity (or) Distance Matrix
Naive Bayes can’t handle the Distance matrix, because it is doesn’t based on distances, it is a probability-based technique.
Large Dimensionality
It is very comfortable for large dimensionality d, when d is a large product of the probability is large then we use Log-probability to avoid numerical stability and underflow issues.
Types of Naive Bayes Classifier
Bernoulli Naive Bayes
This classifier is suitable for discrete data. Bernoulli NB is designed for binary (or) boolean features. The binary Bag of words (BOW) is a simple example of Bernoulli NB. It is used normally when we have Bernoulli distribution. The parameters that we use to predict the class variable take up only values yes or no, for example, if a word occurs in the text or not.
Multinomial Naive Bayes
This is mostly used for the document classification problems, i.e whether a document belongs to the category of sports, politics, technology, etc. The features (or) predictors used by the classifier are the frequency of the words present in the document.
Gaussian Naive Bayes
When we are dealing with the continuous data (or) Numerical Features, Gaussian Naive Bayes is useful. It is used when data has Gaussian distribution. When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution.
Since the way the values are present in the dataset changes, the formula for conditional probability changes to,
Although these methods vary in form, the core idea behind is the same assuming the feature satisfies a certain distribution, estimating the parameters of the distribution, and then get the probability density function.
Best and Worst cases of Naive Bayes Algorithm
- Naive Bayes makes an assumption that features are conditionally independent. Theoretically, if the assumption does not hold true then the performance of NB degrades. But the research has shown that even if there is some feature dependency the Naive Bayes gives the best result.
- However, the strong violation of this assumption will lead to the poor performance of Naive Bayes.
- Naive Bayes is a benchmark (or) baseline algorithm for text classification problem.
- Naive Bayes is used extensively when data contains categorical features but not much used in numerical features.
- Naive Bayes has very good interpretability and feature importance.
- Naive Bayes is very time and space-efficient which can be used for low latency applications.
- Naive Bayes easily overfits so the Laplace smoothing must be always be done.
Naive Bayes Algorithm with Amazon Food Reviews Analysis
Let's apply the Naive Bayes algorithm for the real-world dataset Amazon Fine Food Review Analysis from Kaggle.
First We want to know What is Amazon Fine Food Review Analysis?
This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review. We also have reviews from all other Amazon categories.
Amazon reviews are often the most publicly visible reviews of consumer products. As a frequent Amazon user, I was interested in examining the structure of a large database of Amazon reviews and visualizing this information so as to be a smarter consumer and reviewer.
Source: https://www.kaggle.com/snap/amazon-fine-food-reviews
The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.
Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 — Oct 2012
Number of Attributes/Columns in data: 10
Attribute Information:
- Id
- ProductId — unique identifier for the product
- UserId — unique identifier for the user
- ProfileName
- HelpfulnessNumerator — number of users who found the review helpful
- HelpfulnessDenominator — number of users who indicated whether they found the review helpful or not
- Score — a rating between 1 and 5
- Time — timestamp for the review
- Summary — Brief summary of the review
- Text — Text of the review
Objective
Given a review, determine whether the review is positive (rating of 4 or 5) or negative (rating of 1 or 2).
Data Preprocessing
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.
To Know the Complete overview of the Amazon Food review dataset and Featurization visit my previous blog link here.
Assign the data to dependent features X, and Target to the Y.
X=data['preprocessed_reviews'].values
Y=data['Score'].values
Train-Test split
The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.
If you have one dataset, you’ll need to split it by using the Sklearn train_test_split
function first.
Text Featurization using Bag of Words
Hyper Parameter tuning
we want to choose the best alpha for better performance of the model, to choose the best alpha by using Grid Search cross-validation.
we already defined a Grid_search Function when we call it, it will give the result.
Testing with Test data
After we find the best alpha using a Grid search CV we want to check the performance with Test data, in this case study, we use the AUC as the Performance measure.
we already defined a Function for testing the test data when we call it, it will give the result.
Performance Metrics
Performance metrics are used to measure the behavior, activities, and performance of a business. This should be in the form of data that measures required data within a range, allowing a basis to be formed supporting the achievement of overall business goals.
To Know detailed information about performance metrics used in Machine Learning please visit my previous blog link here.
we already defined a Function for performance metrics when we call it, it will give the result.
Feature importance
The tops 10 features of positive class and top 10 features of negative class for both feature sets and using values of `feature_log_prob_` parameters.
Similarly, we built a Naive Bayes model with TFIDF features also. To understand the full code please visit my GitHub link.
Conclusions
Naive Bayes algorithms are mostly used in sentiment analysis, spam filtering, recommendation systems, etc. They are fast and easy to implement but their biggest disadvantage is that the requirement of predictors to be independent. In most of the real-life cases, the predictors are dependent, this hinders the performance of the classifier.
To write concussions in the table we used the python library PrettyTable.
The pretty table is a simple Python library designed to make it quick and easy to represent tabular data in visually appealing tables.
Observations
- Compare to Bag of words features representation, TFIDF features are got the highest 95.51% AUC score on Test data.
- Both are having 0.1 as the best alpha by Hyperparameter tuning.
- Both models have reasonably worked well for Amazon_food_reviews classification.
To Know the Complete overview of the Amazon Food review dataset and Featurization visit my previous blog link here.
To Know detailed information about performance metrics used in Machine Learning please visit my previous blog link here.
To understand the full code please visit my GitHub link.
References
- Applied AI
- Wikipedia
- Coursera
- Data Camp
Thanks for reading and your patience. I hope you liked the post, let me know if there are any errors in my post. Let’s discuss in the comments if you find anything wrong in the post or if you have anything to add…
Happy Learning!!