News Category Dataset Classification using Naive Bayes Classifier — from Scratch

Retty George
8 min readApr 19, 2022

--

The goal is to build a text classifier for news category dataset using naive baye’s classifier. The news category dataset contains 200853 news data that belongs to 41 different categories.

Find the link to Google Colab Notebook here

Before going deep in to the project Let’s see what is an Text classification.

Text classification is a machine learning technique for classifying unstructured text into a set of specified categories. Text classifiers may classify, categorize, and organize nearly any sort of text, including documents, medical studies, and files, as well as text obtained on the internet. [2]

For example, new articles can be categorized by themes, help requests by urgency, chat conversations by language, brand mentions by mood, and so on.

Here I am using naive bayes classifier for text classification

Let’s see what is Naïve Bayes Classifier

A Naive Bayes classifier is a Bayes theorm-based probabilistic machine learning model. The naive bayes classifier gets its name from the assumption that each characteristic in a class is unconnected to the others.

Bayes Theorem:

The Bayes Theorem defines the likelihood of an event based on prior knowledge of conditions that may be relevant to that event.

We can calculate the likelihood of A occurring given B using Bayes’ theorem. B stands for evidence, while A stands for hypothesis. The predictors/features are assumed to be independent. That is, the presence of one trait has no bearing on the presence of another. As a result, it is considered naïve.

Reasons to use Bayes’ Theorem:

  1. A valuable perspective for understanding and evaluating various learning algorithms is the theorem.
  2. It calculates explicit hypothesis probabilities and is resistant to noise in the input data.
  3. It reduces the likelihood of misclassification in statistical categorization.

Let’s start coding….

  1. First step is to load the data — we need to load the given json file using panda
  2. Split the data into test and train datasets
  3. Check the unique categories present in the train dataset
  4. Calculate the number of records in all the 41 categories
  5. Plot a bar graph with number of records in each category vs the category.
  6. Calculate the probability of each category in the train dataset
  7. Split the headline of each category in to individual words and store it in a list and remove all the stopewords and words with less number of frequency.
  8. Build the vocabulary list by calculating the count of each words in the 41 categories and storing that in a dictionary
  9. Calculate the probability of each individual words in the vocabulary list in a dictionary
  10. Calculate the conditional probability of each words in the vocabulary list and store it in a dictionary
  11. Now take all the headlines in the train dataset and calculate the probability that each headline belongs to a particular category. Calculate the same for all the 41 categories.
  12. Calculate the maximum probability from the step above and that’s the category where a particular headline belongs to based on our naive bayes classifier.
  13. Compare the predicted category with the actual category and then calculate the accuracy for train dataset.
  14. Repeat the above step for test dataset and calculate the accuracy
  15. Split the data into k-folds and then calculate the accuracy for test dataset
  16. Apply Smoothing and then calculate the accuracy

Loading the data

We have loaded the json file using the read_json in pandas that converts a json string to pandas.

Now Let’s split the data into text and train datasets

After loading the data we have to divide the dataset into test, train and dev datasets.

We have used simple python split function to divide the input dataset as follows

Train dataset — 60%

Test dataset — 20%

Development dataset — 20%

Check the unique categories present in the train dataset

we have used the pandas unique function to print all the unique categories in the train dataset. We have total 41 unique categories in the news category dataset.

Calculated the number of records in all the 41 categories

Next step is to calculate the number of records in all the 41 categories. Politics category is having the higher number of records followed by Entertainment, Wellness and Money category is having less number of records followed by Culture & Arts and Enviornment.

Calculate the probability of each category in the train dataset

In order to calculate the conditional probability of each word in the headline of all the 41 categories first steps is to calculate the prior probability . We have already calculated the total number of records in the train dataset and the count of records in all the 41 categories. By using this we can calculate the prior probability of each category.

Split the headline of each category in to individual words and store it in a list and remove all the stopewords and words with less number of frequency

The next steps is to split the individual words in all the headlines in the test dataset in order to calculate the conditional probability of all the words. After that we need to do Text-preprocessing and removal of stopwords.

Text pre-processing

It is the process of preparing text data so that machines can use the same to perform tasks like analysis, predictions, etc

What are some examples of stop words?

Stop words are words that are typically filtered out before a natural language is processed. These are the most common words in any language (for example, articles, prepositions, pronouns, conjunctions, and so on) and do not offer much information to the text. The words “the,” “a,” “an,” “so,” and “what” are examples of stop words in English.

Why are stop words removed?

Stop words abound in any human language. We remove the low-level information from our text by deleting these terms, allowing us to focus on the crucial information. In other words, removing such phrases has no negative impact on the model we train for our purpose.

Build the vocabulary list and calculate the conditional probability list

The next step is to build a vocabulary list and calculate the conditional probability of each words.

P[“the”|Money] = count of “the” in Money category / number of words in. Money category

Calculate the conditional probability of each words

Calculate the probability of each headline

Here we will calculate probability of each deadline and then calculate the probability of a particular headline to belongs to a particular category and then calculate the maximum probability and that will be the predicted category of a particular headline.

Based on the predicted category and actual category calculate the accuracy of the train dataset

Laplace smoothing

In Nave Bayes, Laplace smoothing solves the problem of zero probability. P(w’|positive) can be represented as using Laplace smoothing.

Here,
alpha represents the smoothing parameter,
K represents the number of dimensions (features) in the data, and
N represents the number of reviews with y=positive

Even if a word is not included in the training dataset, the probability will not be zero if we choose alpha!=0 (not equal to 0).

Calculate the accuracy of the test dataset

We have tested our model with different smoothing values and the accuracy is as following.

Model 1(Smoothing value = 0) — Accuracy is 38.11

Model 2(Smoothing value = 1) — Accuracy is 65.71

Model 3(Smoothing value=0.01) — Accuracy is 60.38

K-Fold

Each set (fold) of training and the test would be done exactly once throughout the process. It helps to prevent overfitting. As we all know, the best performance accuracy comes from training a model with all of the data in one session. K-fold cross-validation aids in the creation of a generalized model to address this.

Let’s begin with a standard K value. If K=5, we are splitting the dataset into 5 folds and running the Train and Test algorithm. One fold will be tested during each run, while the others will be used for training and iterations.

Google Colab Notebook Link

My Contribution

My contribution involves implementing the Naive Bayes classifier from scratch using python and pandas (100% contribution). I have used python and pandas to implement the classifier from scratch and it was bit challenging to implement without the use of any library.

My contribution also involves understanding and implementing Laplace smoothing technique and compare the performance of the classifier by using different smoothing values.

Challenges and solutions

The main challenge was to implement Naive Bayes classifier from scratch using python and pandas. Splitting the headline to individual words and then building the vocabulary list and calculating conditional probability all from scratch was challenging.

Reference:

  1. https://www.theguardian.com/media/2015/dec/16/newspapers-now-the-least-popular-medium-for-news-says-ofcom-study
  2. https://memegenerator.net/instance/54909188/game-of-thrones-brace-yourselves-bayesian-analysis-is-coming
  3. https://monkeylearn.com/text-classification/
  4. https://blog.clairvoyantsoft.com/mlmuse-naivety-in-naive-bayes-classifiers-9c7f6ba952bf
  5. https://www.kaggle.com/code/irfanmansuri/news-data
  6. https://blog.clairvoyantsoft.com/mlmuse-naivety-in-naive-bayes-classifiers-9c7f6ba952bf
  7. https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c
  8. https://medium.com/analytics-vidhya/na%C3%AFve-bayes-algorithm-5bf31e9032a2
  9. https://towardsdatascience.com/text-pre-processing-stop-words-removal-using-different-libraries-f20bac19929a
  10. https://towardsdatascience.com/laplace-smoothing-in-na%C3%AFve-bayes-algorithm-9c237a8bdece
  11. https://www.analyticsvidhya.com/blog/2022/02/k-fold-cross-validation-technique-and-its-essentials/

--

--