Naïve Bayes Classifier for Text Classification of Customer Reviews

Published in

MITB For All

6 min readJun 14, 2024

In this post, we will explore using the Naïve Bayes Classifier to differentiate 1-star and 5-star customer text reviews. In particular, we will focus on how feature importance can be utilized to evaluate our model's performance.

Classification algorithms are fundamental techniques in machine learning, employed to predict the category or label of a given input based on training data. These algorithms are essential to various applications, including spam detection, predicting success or failure outcomes, and image recognition. One such algorithm is the Naïve Bayes classifier.

The Naïve Bayes Classifier

Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem and assumes independence among features. This means that, given a class label, the presence or absence of a particular feature is not affected by the presence or absence of any other feature.

In this example, we use Naïve Bayes to classify 1-star and 5-star customer text reviews from Yelp’s dataset and look into how feature analysis can help us better understand our model.

Applying Naïve Bayes to Text Classification

Dataset

Yelp is an online platform that allows users to discover and review local businesses, such as restaurants, shops, and service providers. A subset of their crowd-sourced reviews about businesses has been made available as part of the Yelp Dataset Challenge (updated data can be downloaded here, we will only be using the 2004 dataset).

Here, we focus on 2 columns under the review file: stars and text.

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('yelp_2004.csv')

## Alternatively, if raw data is in json:
# data_file = open("yelp_academic_dataset_review.json")
# data = []
# for line in data_file:
#     data.append(json.loads(line))
# data_file.close()
# df = pd.DataFrame(data)
# df = df[['stars','text']]

# Create a new dataframe with just 1 and 5 stars
df_filtered = df.loc[(df.stars == 1) | (df.stars == 5),:].copy()
df_filtered.reset_index(drop=True,inplace=True)

# Stratified split based on stars (train 80%, test 20%)
df_train, df_test = train_test_split(df_filtered, stratify=df_filtered.stars, test_size = 0.2, random_state = 2024)

Data Preprocessing

Before fitting our training data into the Naïve Bayes classifier, we will first need to convert the text data into numerical features. This transformation can be achieved through tokenization and vectorization functions on scikit-learn such as CountVectorizer (example here) or TfidfVectorizer (TF-IDF [Term Frequency-Inverse Document Frequency] ).

from sklearn.feature_extraction.text import CountVectorizer

# Convert text data into numerical matrices 
Vectorizer = CountVectorizer(min_df=10, ngram_range=(1, 2)) 
X_train = Vectorizer.fit_transform(df_train.text)
y_train = df_train.stars

X_test = Vectorizer.transform(df_test.text)
y_test = df_test.stars

# Get list of words from vectorizer
feat_vocab = Vectorizer.get_feature_names_out()

Model Training

We begin by initializing the Naive Bayes classifier with its default settings. Next, we fit this classifier using our dataset, which contains labeled 1-star and 5-star text reviews tokenized in the previous step.

from sklearn import naive_bayes

# Initialize MultinomialNB and fit with training data
mnb = naive_bayes.MultinomialNB()
mnb.fit(X_train, y_train)

Model Evaluation

Post training, we will evaluate the model's performance using confusion matrices for both the training and testing datasets. A confusion matrix shows the accuracy of predictions by comparing predicted labels against actual labels. Here, the confusion matrix here is further normalized by converting the counts to percentages within each true label category. This allows for a more meaningful evaluation of the output in cases where the classes are imbalanced.

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import model_selection, metrics

# Get prediction from trained model
y_train_pred = mnb.predict(X_train)
y_test_pred = mnb.predict(X_test)

def confusionmat_plot(y_actual, y_predict, ax):
    cm_lab = ['True Neg','False Pos','False Neg','True Pos']
    cm_count = metrics.confusion_matrix(y_actual, y_predict)
    cm_count = ['{0:}'.format(value) for value in cm_count.flatten()]

    cm_perc = metrics.confusion_matrix(y_actual, y_predict, normalize='true')
    cm_perc = ['{0:0.2f}%'.format(100*value) for value in cm_perc.flatten()]

    labels = [f'{l}\n{c}\n{p}' for l, c, p in zip(cm_lab,cm_count,cm_perc)]
    labels = np.asarray(labels).reshape(2,2)

    sns.heatmap(metrics.confusion_matrix(y_actual, y_predict), fmt='s', annot=labels, ax=ax, cmap='Blues')
    ax.set(ylabel="Actual", yticklabels=["1-Star","5-Star"], xlabel="Predicted", xticklabels=["1-Star","5-Star"])

fig, axes = plt.subplots(ncols=2,figsize=(15, 6))
confusionmat_plot(y_train, y_train_pred, axes[0])
axes[0].set(title="Training Data")

confusionmat_plot(y_test, y_test_pred, axes[1])
axes[1].set(title="Testing Data")

Our analysis revealed that the model produced a slightly higher proportion of false negatives compared to false positives, erroneously classifying more 5-star reviews as 1-star. A false negative occurs when the model incorrectly predicts a negative class for an instance that is actually positive. This imbalance suggests the model has a tendency to overlook true positive cases.

To follow up on this, we can do an error analysis by examining the feature importance to identify patterns or common characteristics leading to these errors. This would help us identify the limitations of our model and how to improve on it.

Assessing Feature Importance

Feature importance in the Naïve Bayes model can be gauged using the odds ratio, defined as:

Here, a value >1 suggests that given the presence of a particular word in the text, membership in Class1 is more likely as compared to Class2. Conversely, a value <1 indicates that membership in Class2 is more likely than in Class1 given the presence of that word.

Combining this with Bayes’ theorem:

This converts the formula to:

Given that the priors are consistent across all words (i.e., P(Class1) does not change for each word), the odds ratio can be simplified and approximated by:

The MultinomialNB model in scikit-learn provides the log conditional probability of each word for each class through the feature_log_prob_ attribute (i.e., log(P(Word|Class1))). By subtracting the log conditional probabilities of one class from the other, we can approximate the log of the odds ratio above using:

By ranking these log odds ratio values, we can then identify the top features for each class. Here, we will compute it for both 1-star and 5-star:

# Get the log probability of each class
feat_prob = pd.DataFrame(mnb.feature_log_prob_,index=["Star1_CondProb","Star5_CondProb"],columns=feat_vocab).T
feat_prob.reset_index(inplace=True)

# Compute log odds (where odds_class0 = log(prob(feat|class0)) - log(prob(feat|class1)) = log(prob(feat|class0)/prob(feat|class1)))
feat_prob["Odds_1Star"] = feat_prob.Star1_CondProb- feat_prob.Star5_CondProb
feat_prob["Odds_5Star"] = feat_prob.Star5_CondProb- feat_prob.Star1_CondProb

feat_prob.sort_values(by = ["Odds_1Star"],ascending=False, inplace=True,ignore_index=True)
print('Top 10 features most predictive of 1-star reviews:')
print(feat_prob.loc[:,["index","Odds_1Star"]].head(10))

feat_prob.sort_values(by = ["Odds_5Star"],ascending=False, inplace=True,ignore_index=True)
print('\nTop 10 features most predictive of 5-star reviews:')
print(feat_prob.loc[:,["index","Odds_5Star"]].head(10))

Insights from Feature Analysis

In general, the top features for the 1-star and 5-star reviews match our expectations. For 1-star reviews, negative words such as "pissed", "zero stars", and "worst experience" predominate. For 5-star reviews, positive terms like "delectable", "not disappoint", and "must try" stands out. However, the terms most indicative of 5-star reviews also appears to be highly food-related, such as "melted in", "richness", and "buttermilk”. Thus, they may not generalize well beyond this specific context and contribute to the higher rate of false negatives observed in the confusion matrix.

Conclusion

By analyzing feature importance of the Naïve Bayes classifier we gain valuable insights into the keywords that drive our model's predictions and it also informs us about the context limitations of our model.

Disclaimer: All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.