Text-Classification using Naive Bayes Classifier (NBC) from scratch

Gauravthorat
17 min readApr 16, 2023

--

Img src : https://medium.com/text-classification-algorithms/text-classification-algorithms-a-survey-a215b7ab7e2d [1]

GitHub : https://github.com/G-G-Thorat/DM_Assn_2/blob/main/nbc_gt.ipynb

Homepage : https://g-g-thorat.github.io/

Goal

The primary aim of this article is to guide you through the process of understanding the NBC for text classification, and constructing the classifier from scratch using Python. To accomplish this, we will utilize the Rotten Tomatoes Reviews Dataset [5], which is available on Kaggle, as our text corpus for classification purposes.

By the end of this post, you’ll not only grasp the mechanics of the Naïve Bayes Classifier (NBC) approach but also gain insight into its underlying principles. To achieve this goal, we’ll take you through the methodology step by step. But before that, let’s explore some fundamental concepts that will serve as a foundation for our discussion.

You can access my implementation of the Naive Bayes Classifier from scratch in my Kaggle notebook, titled “NBC_GT”, uploaded to my GitHub as well.

So, let’s not delay for another second and dive in the journey of understanding this interesting concept.

Introduction

“The power of machine learning comes from the fact that it can find patterns in data that we cannot see ourselves.” — Demis Hassabis

The aforementioned quote is a source of motivation as we embark on the journey of learning about the Naïve Bayes Classifier (NBC) through this blog post.

The intersection of computer science, artificial intelligence, and linguistics has given rise to the vast field of natural language processing (NLP). This field encompasses a diverse range of topics with practical applications in the real world, such as named entity recognition, machine translation, and machine question answering. Each of these topics employs a distinctive approach to processing textual data, as outlined in [1].

Nevertheless, before delving into more complex NLP applications, it’s crucial to understand how basic processes, such as text classification, are executed. This knowledge provides a foundation for comprehending more advanced concepts and techniques in NLP.

Let’s explore what text classification entails.

What does the term “Text Classification” mean?

Text classification, also known as text tagging or text categorization, is the process of sorting text into meaningful categories. This involves utilizing Natural Language Processing (NLP) techniques to automatically analyze text and assign it to specific categories based on its content using a set of predefined tags, as explained in [14].

Some examples of text classification, as cited in [14], include:

  • Sentiment analysis: a technique for identifying whether a text expresses positive or negative sentiment towards a particular subject (e.g., monitoring brand reputation).
  • Language detection: the process of determining the language in which a text is written (e.g., to automatically route support tickets to the appropriate team based on the language).

The most widely used techniques for text classification include Support Vector Machines (SVM), Deep Learning, and Naive Bayes Classifiers. In this blog, we will take a closer look at the Naive Bayes Classification algorithm. Let’s get started.

Naive Bayes Classifier

The Naive Bayes classification family employs Bayes’ Theorem and probability theory to forecast the appropriate tag for a given text, such as a news article or a customer review, as noted in [11]. Since they are probabilistic, these algorithms calculate the probabilities of each tag for a given text and assign the tag with the highest probability [11]. Bayes’ Theorem is utilized to determine these probabilities.

Before delving into Bayes’ Theorem, it’s essential to familiarize ourselves with the common terminologies it employs.

Probabilities

Probability is a branch of mathematics that deals with the measurement and quantification of the likelihood of an event occurring or a statement being true. The probability of an event is represented by a number between 0 and 1, with 0 indicating impossibility and 1 indicating certainty. As the probability of an event increases, so does the likelihood of it occurring.

Consider a fair six-sided dice, with each face having a unique number from 1 to 6. The probability of rolling any one of the numbers on the dice is 1/6, as there are 6 equally likely outcomes. The probability of rolling any number less than or equal to 3 is 3/6 or 1/2, as there are only three numbers that satisfy this condition. Similarly, the probability of rolling an even number is 3/6 or 1/2, since there are three even numbers on the dice.

If we roll the dice twice, the probability of getting two sixes in a row is (1/6) * (1/6) = 1/36, since the probability of rolling a six on the first roll is 1/6, and the probability of rolling another six on the second roll is also 1/6.

Probability theory is useful in many fields, including statistics, machine learning, and finance, to name just a few.

Img src : Quora

Conditional Probabilities

Conditional probability refers to the probability of an event or outcome occurring given that a prior event or outcome has already happened. This probability is calculated by multiplying the probability of the prior event by the likelihood of the conditional occurrence. The concept of conditional probability is important in various fields, including statistics, machine learning, and artificial intelligence.

For example, in the context of machine learning, conditional probability can be used to predict the likelihood of a certain outcome based on the occurrence of certain events or factors. In natural language processing, for instance, the probability of a certain word or phrase appearing in a sentence can be determined based on the occurrence of certain words or phrases in the sentence or in a larger corpus of texts.

Overall, the concept of conditional probability is a fundamental principle in probability theory and has numerous applications in various fields. Its understanding is crucial for accurately predicting outcomes and making informed decisions.

Img src : Packt

Let’s consider an example. Suppose we have a deck of cards, and we draw one card randomly. There are four suits in a deck of cards: spades, hearts, diamonds, and clubs, each with 13 cards. What is the conditional probability of drawing a heart card given that the card drawn is red?

There are 26 red cards in the deck, half of which are hearts. Therefore, the probability of drawing a red card is 26/52, or 0.5. Now, we need to find the probability that the card drawn is a heart given that it is red. Since we know that there are 13 heart cards in the deck, the probability of drawing a heart card given that the card drawn is red is:

P(heart|red) = P(red|heart) x P(heart) / P(red)

where P(red|heart) is the probability of drawing a red card given that the card drawn is a heart, P(heart) is the probability of drawing a heart card, and P(red) is the probability of drawing a red card.

Since all heart cards are red, P(red|heart) = 1. The probability of drawing a heart card is 13/52, or 0.25. We have already found that the probability of drawing a red card is 0.5. Therefore, the conditional probability of drawing a heart card given that the card drawn is red is:

P(heart|red) = 1 x 0.25 / 0.5 = 0.5

This means that the probability of drawing a heart card given that the card drawn is red is 50%.

Let’s address the elephant in the room and talk about Bayes theorem, an important application of probability.

Bayes Theorem

Bayes’ Theorem is a mathematical concept that allows us to calculate the probability of an event occurring based on prior knowledge or information [12]. In the context of text classification, Bayes’ Theorem is used to calculate the probability that a piece of text belongs to a particular category or class based on the occurrence of certain words or phrases within the text.

The formula for Bayes’ Theorem is as follows:

P(A|B) = P(B|A) * P(A) / P(B)

where A and B are events, P(A) and P(B) are the probabilities of events A and B occurring respectively, P(B|A) is the conditional probability of B given A, and P(A|B) is the probability of A given B.

img src : byjus

In the context of text classification, A represents a particular class or category (e.g., positive or negative sentiment), while B represents a piece of text. We can use Bayes’ Theorem to calculate the probability that the text belongs to class A given the occurrence of certain words or phrases within the text.

To do this, we need to calculate the conditional probability of the occurrence of those words or phrases given class A (P(B|A)). We also need to calculate the prior probability of class A (P(A)) and the prior probability of the occurrence of those words or phrases (P(B)).

For example,

Let’s say a rare disease affects 1 in 10,000 people. A medical test has been developed to detect this disease, but it’s not perfect. The test correctly identifies the disease 99% of the time, and it incorrectly identifies it 1% of the time for people who don’t have the disease.

Suppose a person tests positive for the disease. What is the probability that they actually have the disease?

Using Bayes’ theorem, we can calculate this as follows:

P(A) = Probability of having the disease = 1/10,000 P(B) = Probability of testing positive for the disease = (0.99 x 1/10,000) + (0.01 x 9,999/10,000) = 0.0108 P(B|A) = Probability of testing positive given that the person has the disease = 0.99 P(A|B) = Probability of having the disease given that the person tested positive = ?

Now, we can use Bayes’ theorem to calculate P(A|B):

P(A|B) = (P(B|A) x P(A)) / P(B) = (0.99 x 1/10,000) / 0.0108 = 0.0091

So, even though the person tested positive, the probability of them actually having the disease is only about 0.91%.

Moving ahead,

Additive Smoothing

Additive smoothing, also known as Laplace smoothing, is a technique used to address the problem of zero probability in Bayesian statistics and machine learning. When a particular feature or word in a text has never been seen before in the training data, it can cause problems in calculating the probability of a particular category or tag for that text. Laplace smoothing adds a small value (usually 1) to the frequency count of each feature in order to avoid zero probability issues [8].

The formula for Laplace smoothing is:

P(word | class) = (count(word, class) + 1) / (count(class) + vocabulary_size)

where:

  • count(word, class) is the number of times the word occurs in documents of the given class
  • count(class) is the total number of documents in the given class
  • vocabulary_size is the total number of unique words in the vocabulary

For example, suppose we have a dataset of customer reviews that we want to classify as positive or negative. We count the frequency of each word in the reviews to determine which words are most strongly associated with positive or negative reviews. However, if a word has never appeared in the training data, it will have a frequency count of zero. This will cause problems when calculating the probability of a review being positive or negative, as the probability of a particular category will be zero if any of the words in the review have a frequency count of zero.

Laplace smoothing solves this problem by adding 1 to the frequency count of each word, including words that have never appeared in the training data. This ensures that no probability value will ever be zero, and it smooths out the probabilities across all possible features, preventing overfitting.

Now comes the good part.

My Contribution

As part of my contribution, I have implemented a Naive Bayes Classifier (NBC) from scratch to classify the Rotten Tomatoes Review dataset. Let me elaborate on this further.

  1. My primary contribution in this project involved gaining a deeper understanding of the Naive Bayes Classifier (NBC) and its key terminologies, including Bayes theorem, probability, conditional probability, and Laplace smoothing. I also took responsibility for merging the Rotten-Tomatoes Review dataset and dividing it into three subsets for training, development, and testing, as required for NBC.
  2. Building the vocabulary list and calculating the probability of each word’s occurrence in the sentence and its conditional probability was another significant aspect of my contribution. I conducted experiments to assess the effect of Laplace smoothing on the accuracy of the classifier.
  3. To accomplish Laplace smoothing, I modified the alpha values to avoid zero probability and used values such as 1, 0.1, and 0.01. I then compared the impact of smoothing for each parameter.
  4. To better understand the data and apply Bayes theorem and smoothing effectively, I also visualized the data at each step. Another significant part of my contribution involved finding the accuracy of the development dataset and offering flexibility in evaluating the dataset by removing stop words or without removing them.
  5. Finally, I determined the optimal smoothing technique for my NBC algorithm by assessing the dataset’s accuracy under different smoothing parameters.

Allow me to guide you through my code where I have implemented the Naive Bayes Classifier from scratch. Let’s dive right in…

The initial step is to merge and preprocess the data by removing any unwanted characters or noise —

Processing Data:

Load Data:

Step 1: We were given the kaggle dataset, Rotten Tomatoes Review dataset [5]. We downloaded the dataset and found one .csv file in it named — rt_reviews.csv. It is loaded as follows —

Cleaning the Data:

Step 2: The initial step will involve converting the sentence to lowercase and eliminating any punctuation or special characters that may be present in the text.

Splitting the Dataset:

Step 3: The dataset will be divided into three separate groups: training data, test data, and development data.

To organize the data and make it easier to work:

Step 4: Separate the data into two categories: features and targets. This will enable us to handle each dataset type (training, development, and testing) more effectively.

Data Representation:

Step 5: Visualize or represent the data in terms of bar graph, representing the types of data we have in “Freshness” column.

Build Vocabulary and Word Count:

Step 6: Now, moving ahead we had to build vocabulary for the dataset, since in Naive Bayes classifier, we need to build a vocabulary to prepare the training data for classification. The vocabulary is a list of unique words that occur in the training data, along with their frequency count. This is done to convert the textual data into a numerical format that can be processed by the algorithm. The reason we need to build a vocabulary is that Naive Bayes is a probabilistic algorithm that works with probabilities of words in the text. To calculate these probabilities, we need to know the frequency of each word in the training data. By building a vocabulary, we ensure that we have a complete and consistent list of all the words in the training data, which helps the classifier to accurately predict the class of new text data.

Import necessary files:

Vocabulary Building Function:

Calculate Probabilities

Word Probabilities:

Step 7: To determine the probability of each word and its conditional probability based on the sentence type, we will calculate them for each sentence class. Moreover, we will also determine the prior probability of each sentence class.

Prior-Probability

Probability for each word

Conditional Probability

NLTK — Natural Language Toolkit

NLTK stands for Natural Language Toolkit. It is a Python library that provides tools and resources for natural language processing (NLP). NLTK offers a range of functionalities such as tokenization, stemming, part-of-speech tagging, parsing, semantic reasoning, and more. It also includes a vast collection of corpora, lexical resources, and grammars for numerous languages, making it a comprehensive toolkit for NLP tasks. NLTK is widely used in research and industry for various applications such as sentiment analysis, text classification, machine translation, and speech recognition.

Stop words are commonly used words that are ignored by search engines during indexing and retrieval of search results. These words include “the,” “a,” “an,” and “in” as examples [13]. To avoid occupying unnecessary storage space and processing time in our database, it is essential to keep track of these stop words and remove them. As mentioned in [13], maintaining a record of the stop words can help eliminate them easily. NLTK, a Natural Language Toolkit in Python, provides a list of stop words in 16 different languages that can be utilized for this purpose.

Step 8: In order to analyze the path ahead we need to first remove stopwords and calculate the accuracy according to it.

Before removing Stopwords

After removing Stopwords

Step 9: We are now able to use the probability standards and calculate conditional probability on a single sentence on the development set, since it will take much time calculating it on entire set of development dataset.

Smoothing

In the context of statistics and data analysis, smoothing is a technique used to reduce the noise or variability in a data set by creating a simplified representation of the data that captures its general trends or patterns.

Smoothing involves applying a mathematical function or algorithm to the data, which averages out the values of neighboring points or data points over a certain interval or window. This results in a smoother curve or line that more accurately represents the underlying trend or pattern in the data, and reduces the effects of random fluctuations or errors.

Smoothing can be particularly useful when working with noisy data, where the data points may contain random or spurious fluctuations that obscure the underlying pattern. By smoothing the data, it is possible to get a clearer picture of the true trend or pattern, which can help in making predictions or identifying important features in the data.

Step 10: In order to implement it we create the following function:

Step 11: Now we will experiment on the basis of smoothing alpha values from 0 to 1 in range of 0, 0.1, 0.01 and 1 as follows:

Alpha = 0 / None

Alpha = 0.1

Alpha = 0.01

Alpha = 1

On the basis of the accuracies, I’ve compared it:

From the comparison, it is observed that the value of alpha = 1 is most optimal for the classifier.

Step 12: Final step is to calculate the accuracy on the test dataset to check the whether the classifier is reliable or not.

We have received an accuracy of 85% on the test dataset.

Challenges Faced and Solutions:

Developing a Naive Bayes Classifier (NBC) is not a simple task, as it requires a clear understanding of the statistical approach that it employs. Overcoming the challenges that arise during the development process is crucial to ensure the classifier’s accuracy.

I encountered several difficulties during the development process, which I will describe briefly. However, I was able to find effective solutions to overcome these issues.

  1. Initially, I faced some difficulties in comprehending certain terminologies, such as conditional probability and Laplace smoothing, which are essential to developing a Naive Bayes Classifier. To overcome this challenge, I decided to take an Udemy course [7] that provided a comprehensive understanding of these concepts. Additionally, I referred to reliable sources such as [8] and [9], which helped me gain a clearer picture of these important facts.
  2. I also encountered difficulties in dividing the dataset into training, test, and development sets. To overcome this challenge, I referred to a blog post [6], which provided various methods for splitting the data. The author’s insights and recommendations helped me successfully partition the dataset for training, testing, and development purposes.
  3. During the development of the Naive Bayes Classifier, I faced a setback when I realized that I had not defined the functions correctly to calculate the probability of each word and the conditional probability of each word with respect to each sentence type. To overcome this hurdle, I decided to reorganize the code and segregate the necessary functions. This made it easier to reference and use the functions in later code snippets, enabling me to continue the development process smoothly.
  4. One of the major challenge I faced during the development of the Naive Bayes Classifier was conducting Laplace smoothing experiments. The primary difficulty was in defining the alpha values, which were required to adjust the probabilities of words that did not appear in the training data. To overcome this challenge, I referred to a reliable source [8] to gain a better understanding of the concept. After comprehending the significance of alpha values, I defined them as 1, 0.1, and 0.01, and continued the experimentation process.
  5. While preparing the dataset for the Naive Bayes Classifier, I encountered a minor issue when cleaning up the data to remove punctuation and other special symbols from the sentences. To overcome this issue, I referred to a reliable source [10], which provided useful insights on how to handle such scenarios.

Conclusion:

Based on the above observations, it can be concluded that Naive Bayes classifiers (NBCs) have demonstrated impressive performance in a variety of real-world scenarios, particularly in document categorization and spam filtering, despite their seemingly simplistic assumptions. These classifiers require only a modest amount of training data to estimate the necessary parameters.

Compared to more complex techniques, naive Bayes learners and classifiers can operate at incredible speeds. This is due to the fact that each distribution can be estimated as a one-dimensional distribution, as the class conditional feature distributions are decoupled. Consequently, the problems caused by the curse of dimensionality can be resolved more efficiently. [10]

References:

[1] Text Classification: https://medium.com/text-classification-algorithms/text-classification-algorithms-a-survey-a215b7ab7e2d

[2] ValueError: ‘label’ must be of length ‘x’ using matplotlib, link: https://stackoverflow.com/questions/49991633/valueerror-label-must-be-of-length-x-using-matplotlib

[3] GeeksForGeeks, Remove punctuation from string, link: https://www.geeksforgeeks.org/python-remove-punctuation-from-string/

[4] Wikipedia, Bayes’ Theorem, link: https://en.wikipedia.org/wiki/Bayes%27_theorem

[5] Kaggle dataset, Rotten Tomatoes Reviews Dataset, link: https://www.kaggle.com/datasets/ulrikthygepedersen/rotten-tomatoes-reviews

[6] Towards Data Science, How to split a dataset into training and testing sets, link: https://towardsdatascience.com/how-to-split-a-dataset-into-training-and-testing-sets-b146b1649830

[7] Udemy Tutorial, Naive Bayes Classifier a pure statistical approach to ML, link: https://www.udemy.com/course/naive-bayes-classifier-a-pure-statistical-approach-to-ml/learn/lecture/22130428#overview

[8] Towards Data Science, Laplace Smoothing in NB Algorithm, link: https://towardsdatascience.com/laplace-smoothing-in-na%C3%AFve-bayes-algorithm-9c237a8bdece

[9] GeeksForGeeks, Naive Bayes Classification, link: https://www.geeksforgeeks.org/naive-bayes-classifiers/

[10] StackOverflow, Remove punctuations in pandas, link: https://stackoverflow.com/questions/39782418/remove-punctuations-in-pandas

[11] MonkeyLearn, Practical Explanation of Naive Bayes Classifier, link: https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/

[12] Stanford Encyclopedia of Philosophy, Bayes’ theorem, link: https://plato.stanford.edu/entries/bayes-theorem/

[13] GeeksForGeeks, Removing Stop Words with NLTK IN Python, link: https://www.geeksforgeeks.org/removing-stop-words-nltk-python/

[14] MonkeyLearn, What is Text Classification? link: https://monkeylearn.com/what-is-text-classification/

--

--