Part 2: Dive into Bernoulli Naive Bayes

GridflowAI
10 min readOct 12, 2023

--

Before we dive into the intricacies of Bernoulli Naive Bayes (BNB), it’s essential to acknowledge the foundation we’ve built in our first blog post. Part 1 of this series explored the core concepts of Naive Bayes, including conditional probability and Bayes’ theorem. It also introduced the fundamental idea of hypotheses and evidence, laying the groundwork for understanding how Naive Bayes operates.

We examined how Bayes’ theorem bridges the gap between hypotheses and evidence, enabling us to refine our predictions based on new data. By simplifying this mathematical principle into practical terms, we can adjust our beliefs when presented with fresh information.

So, if you haven’t had the chance to explore Part 1, we encourage you to refer to that blog for a comprehensive introduction to the essential concepts that underpin our journey into Naive Bayes.

In Part 2, we’ll delve deeper into the specialized world of Bernoulli Naive Bayes, a variant tailored for binary data. Let’s continue our exploration!

The Bernoulli Variant

When dealing with machine learning and classification tasks, especially in the realm of text data, the Bernoulli Naive Bayes (BNB) algorithm often shines. Designed specifically for binary data, this variant is best suited for datasets where features showcase either the presence (1) or absence (0) of certain attributes.

Dissecting Binary Data Through Examples

Consider the vast number of emails that flow into our inboxes. By using binary representation, we can categorize these emails based on the presence or absence of specific words or phrases which are common in spam emails. we can construct a vocabulary from the corpus of emails, the presence or absence of these words in each email can be represented in binary, leading to below representation.

This binary representation makes it suitable for Bernoulli Naive Bayes. Another example can be seen with Customer Churn Prediction.

In the world of telecommunications, businesses aim to predict if a customer might stop using their service. This can be modeled using various binary features. Here, each feature — Usage, Frequency, Complaints, Purchase History — is converted to binary values representing high/low, frequent/infrequent, and so forth. Based on this binary data representation, the Bernoulli Naive Bayes classifier can predict if a customer might churn (Yes/No)

Distinguishing BNB from Bernoulli Distribution

While both Bernoulli Naive Bayes and the Bernoulli distribution share the name “Bernoulli”, they serve distinct purposes. BNB is a machine learning algorithm designed for classification, especially apt for text data. The Bernoulli distribution, on the other hand, is a basic probability distribution modeling binary outcome: success (1) and failure (0). However, their commonality lies in handling binary data.

Broad Applications of BNB

BNB’s utility isn’t confined to text. Its applications span various industries

  • Cybersecurity: Email/SMS Spam Detection
  • E-commerce: Sentiment Analysis of Product Reviews
  • Information Retrieval: Document Categorization
  • Telecommunications: Customer Churn Prediction
  • Manufacturing: Fault Detection in Industrial Equipment
  • Healthcare: Medical Diagnosis
  • Finance: Credit Scoring

Step-by-Step with BNB: A Simplified Walkthrough

For those keen on the nitty-gritty, let’s explore a hands-on approach with BNB using a simple dataset.

  1. Data Preparation: Begin with a set of binary data. Each row signifies a data sample while columns represent features.

2. Training Phase — Class Priors: Compute the prior probabilities of each class based on your training data.

Calculate the prior probabilities for each class based on the training data:

  • Total samples = 10
  • Class 0 samples = 6
  • Class 1 samples = 4

Thus:

P (Class 0) = 6/10 = 0.6

P (Class 1) = 4/10 = 0.4

3. Training Phase — Feature Priors: Calculate conditional probabilities for each feature based on their presence (1) or absence (0).

Next, compute the conditional probabilities for each feature based on its presence (1) or absence (0):

For Feature 1:

  1. With Class 0: 3 out of 6 samples have Feature 1 = 1

P(Feature 1=1∣Class 0) = 3 /6 ​=0.5

2. With Class 1: 3 out of 6 samples have Feature 1 = 1

P (Feature 1=1∣Class 1)= 2/4 ​=0.5

For Feature 2:

  1. With Class 0: 3 out of 6 samples have Feature 2= 1

P(Feature 2=1∣Class 0) = 3 /6 ​=0.5

2. With Class 1: 1out of 4samples have Feature 2= 1

P (Feature 2=1∣Class 1)= 1/4 ​=0.25

4. Prediction: Using the trained model, classify new samples.

Calculate the likelihood of each class for the given features and compute the unnormalized posterior probability.

For Class 0:
P(Feature 1=1, Feature 2=0 | Class 0) = P(Feature 1=1 | Class 0) × P(Feature2=0 | Class 0)
= 0.5 × 0.5
= 0.25

Unnormalized Posterior: P(Class 0) × Likelihood = 0.6 × 0.25 = 0.15

For Class 1:
P (Feature 1=1, Feature 2=0 | Class 1) = P (Feature 1=1 | Class 1) × P (Feature 2=0 | Class 1)
= 0.5 × 0.75
= 0.375

Unnormalized Posterior: P (Class 1) × Likelihood = 0.4 × 0.375 = 0.15

Normalize these probabilities

Z = 0.15 (Class 0) + 0.15 (Class 1) = 0.3

P (Class 0|Features) = 0.25 / 0.3 = 0.8

P (Class 1|Features) = 0.15 / 0.3 = 0.2

Making the prediction

With this, you’ve just employed Bernoulli Naive Bayes to classify data. It’s this systematic, probabilistic approach, grounded in Bayes’ theorem, that allows BNB to effectively tackle classification challenges involving binary data.

Bernoulli Naive Bayes implementation on text classification problem

In the field of Natural Language Processing (NLP), text classification is a common task where machine learning models categorize text documents into predefined categories. In this section, we’ll compare the performance of two popular algorithms for text classification: Bernoulli Naive Bayes and Logistic Regression.

The Text Classification Task

Text classification involves training a model to assign text documents to predefined categories or classes. In this specific case, we have a dataset of text documents categorized into three classes: “Food,” “Sports,” and “Technology.” The goal is to develop models that can accurately classify new text documents into one of these categories.

Sample Data

Here is sample data in this dataset.

Document,Category
Smartphones have become an essential part of our lives.,Technology
Apple's latest iPhone features impressive camera technology.,Technology
LeBron James leads the Lakers to a thrilling victory.,Sports
The Super Bowl was filled with excitement and big plays.,Sports
Discover the secret to the perfect chocolate chip cookie recipe.,Food
How to make a delicious homemade pizza from scratch.,Food
Artificial intelligence is transforming various industries.,Technology
World Cup 2022: Excitement builds as teams prepare to compete.,Sports
Indulge in a mouthwatering three-course meal at our restaurant.,Food
Blockchain technology is revolutionizing finance and beyond.,Technology
New advancements in quantum computing are on the horizon.,Technology
An exciting Formula 1 race took place over the weekend.,Sports
Savor the flavors of exotic spices in our international cuisine.,Food
Machine learning algorithms are powering recommendation systems.,Technology
The Olympic Games captivate the world with incredible athleticism.,Sports

Preprocessing

Each document is then transformed into a binary vector using the CountVectorizer—each position in this vector indicates the presence (1) or absence (0) of a particular word from the entire corpus. This results in a dense matrix representation of all documents. This transformed data, showcasing word presence or absence, is then integrated back into the original dataset. Finally, the combined dataset is separated into features (word vectors) and the target variable (document categories) in preparation for machine learning model training.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.feature_extraction.text import CountVectorizer

# Load the preprocessed dataset
df = pd.read_csv('/content/bernouli-text.csv')


# Initialize the CountVectorizer with binary=True for presence/absence encoding
vectorizer = CountVectorizer(binary=True)

# Fit and transform the documents
X = vectorizer.fit_transform(df['Document'])

# Convert the sparse matrix to a dense array
word_vector = X.toarray()

# Print the feature names (words) in the order they appear in the array
feature_names = vectorizer.get_feature_names_out()

# Create a new DataFrame with the word vectors
word_vector_df = pd.DataFrame(word_vector, columns=feature_names)

# Combine the word vector with the original dataset
preprocessed_df = pd.concat([df, word_vector_df], axis=1)


# Split the data into features (X) and the target variable (y)
X = preprocessed_df.drop(['Document', 'Category'], axis=1)
y = preprocessed_df['Category']

Fitting Bernouli Naive Bayes

With the data split, a Bernoulli Naive Bayes classifier is initialized and then trained using the training data. Subsequent to training, this classifier is utilized to predict categories for the test data. The model’s accuracy is then calculated and displayed, followed by a detailed classification report which provides metrics such as precision, recall, and F1-score for each category.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Bernoulli Naive Bayes classifier
bnb = BernoulliNB()
bnb.fit(X_train, y_train)

# Make predictions on the test set
bnb_predictions = bnb.predict(X_test)
# Calculate and print the accuracy of the Bernoulli Naive Bayes classifier
bnb_accuracy = accuracy_score(y_test, bnb_predictions)
print("Bernoulli Naive Bayes Accuracy:", bnb_accuracy)
print("Classification Report for Bernoulli Naive Bayes:")
print(classification_report(y_test, bnb_predictions))

Fitting Logistic Regression

Logistic Regression is another widely used algorithm for text classification. It models the probability that a given text document belongs to a particular class.Logistic Regression model is set up with an increased iteration count (max_iter=1000) to guarantee its convergence during training. Once initialized, the model is trained using the training data and evaluated on testset

# Initialize and train the Logistic Regression classifier
lr = LogisticRegression(max_iter=1000) # max_iter is set to ensure convergence
lr.fit(X_train, y_train)

# Make predictions on the test set
lr_predictions = lr.predict(X_test)

# Calculate and print the accuracy of the Logistic Regression classifier
lr_accuracy = accuracy_score(y_test, lr_predictions)
print("Logistic Regression Accuracy:", lr_accuracy)
print("Classification Report for Logistic Regression:")
print(classification_report(y_test, lr_predictions))

Interpretation of Results

Bernoulli Naive Bayes

  • Accuracy: The model correctly predicts categories with an impressive accuracy of approximately 91.67%. This means that out of every 100 predictions it makes, around 92 are correct.
  • Precision and Recall: A high precision indicates that when the model predicts a category, it is highly likely to be correct. Meanwhile, high recall implies that the model can identify most of the actual instances of each category. Both these metrics being high suggests a robust model.
  • F1-scores: The well-balanced F1-scores signal an equilibrium between precision and recall. This means the model doesn’t heavily favor one over the other, ensuring a consistent performance across different categories.

Logistic Regression

  • Accuracy: This model achieves an accuracy of about 83.33%. While this is still commendable, it trails behind the Bernoulli Naive Bayes model.
  • Precision and Recall: The slightly lower values compared to the Bernoulli Naive Bayes indicate that the Logistic Regression model might either be making more false positive predictions or missing out on some positive instances.
  • F1-scores: The diminished F1-scores suggest that there might be an imbalance between precision and recall. This could indicate a scenario were increasing one metric results in decreasing the other, potentially requiring some optimization for specific use cases.

Why Bernoulli Naive Bayes Performs Better

Bernoulli Naive Bayes outperforms Logistic Regression in this specific text classification task due to the following reasons:

1. Binary Features: Bernoulli Naive Bayes is well-suited for binary features, which align with the nature of text classification.

2. Independence Assumption: The model’s assumption that features are conditionally independent given the class works effectively for text data.

3. High Precision and Recall: Bernoulli Naive Bayes achieves higher precision and recall values across all classes, indicating fewer misclassifications and better capturing of actual instances.

4. Balanced F1-Scores: The F1-scores for Bernoulli Naive Bayes are well-balanced, showcasing an effective trade-off between precision and recall.

Conclusion

In this journey into the world of Naive Bayes, we’ve explored how this specialized variant of Bernoulli Naive Bayes excels in handling binary data, making it the algorithm of choice for various classification tasks, especially in the domain of text data. The flexibility of BNB extends far beyond text classification, finding application across diverse industries.

We’ve also delved into the step-by-step implementation of Bernoulli Naive Bayes using a simplified dataset, highlighting its systematic approach in assigning categories to binary data.

Furthermore, we’ve compared the performance of Bernoulli Naive Bayes to that of Logistic Regression in the context of text classification. The results demonstrated Bernoulli Naive Bayes’ superiority, primarily attributed to its compatibility with binary features, the assumption of feature independence.

It’s worth noting that while Bernoulli Naive Bayes shines in our text classification scenario, the choice of algorithm should always be tailored to the specific dataset and problem at hand. Experimentation and exploration remain key as we continue to unravel the intricacies of other Naive Bayes variants in our ongoing exploration.

So, stay tuned for the next installment as we venture deeper into the realm of Naive Bayes, uncovering the power and versatility of these algorithms one variant at a time!

About Me:
I’m Sudeep Joel, navigating the exciting world of artificial intelligence as a graduate student at Arizona State University. Apart from my academic pursuits, I have a penchant for penning down my insights on AI and ML. To learn more about my journey or connect with me, check out my LinkedIn profile.

--

--

GridflowAI

Immerse yourself in the world of AI, from Statistics to Deep Learning, Computer Vision to Large Language Models with us