Supervised Machine Learning Algorithm Demonstration: Naive Bayes

Sasani Perera
7 min readJul 8, 2023

--

Naive Bayes is a machine learning algorithm that is commonly used for classification tasks. It is based on a mathematical theorem called Bayes’ theorem, named after the 18th-century mathematician Thomas Bayes. This theorem provides a way to update probabilities or beliefs based on new evidence.

The goal of Naive Bayes is to determine the probability or likelihood of a data point belonging to a specific category or class. It works by considering the prior knowledge we have about the categories and combining it with the evidence provided by the data point to calculate the probability of it belonging to each category.

Here’s how it works step by step:

  1. Prior Knowledge: Before we start making predictions, we have some initial knowledge or assumptions about the categories. This knowledge is represented by the prior probabilities, which are the probabilities of each category occurring without considering any evidence from the data point.
  2. Features and Independence Assumption: Naive Bayes makes a simplifying assumption called the “naive” assumption. It assumes that the features or characteristics of the data point are independent of each other, meaning that they don’t influence each other’s probability. Although this assumption may not always hold true in real-world scenarios, it simplifies the calculations and allows for efficient computation.
  3. Conditional Probabilities: Naive Bayes calculates the conditional probabilities for each feature given a specific category. These probabilities represent how likely a particular value of a feature is in the given category. These probabilities can be estimated from the training data by counting occurrences or using other probability estimation techniques.
  4. Bayes’ Theorem: Naive Bayes applies Bayes’ theorem to calculate the posterior probabilities of the categories given the features of the data point. Bayes’ theorem mathematically relates the conditional probabilities, prior probabilities, and evidence (features) to calculate the updated probabilities of the categories. It provides a way to adjust our initial beliefs based on the evidence provided by the data.
Bayes Theorem of conditional probability

5. Maximum Probability: Finally, Naive Bayes predicts the category for the data point by selecting the category with the highest probability. It assigns the data point to the category that is most likely based on the calculated probabilities.

Naive Bayes is widely used in various domains, including text classification, spam filtering, sentiment analysis, and more. It is known for its simplicity, efficiency, and ability to handle large feature spaces. However, the naive assumption of feature independence can be a limitation in certain scenarios where dependencies exist between the features.

Spam filtering

In summary, Naive Bayes is a classification algorithm that calculates the probabilities of a data point belonging to different categories using Bayes’ theorem. It leverages prior knowledge, makes the naive assumption of feature independence, and computes posterior probabilities to make predictions.

Let us start training a model with an example data set, diabetes.csv.

In this demonstration, we will train a model to detect if the patient is diabetic or not. Now we have an idea about how to work on Google Colab and add necessary .csv files and read them.

  1. Understanding the data

We upload our data file to Colab and read the file using (pandas.read_csv).

#import the packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#loading the dataset
data=pd.read_csv('/content/diabetes.csv')

Then we take a look at the data that we just loaded with (pandas.DataFrame.head), (pandas.DataFrame.tail), (pandas.DataFrame.shape), etc.

2. Detect and treat any possible missing values

Then we must see if there are any missing values in the data set.

data.isna()
data.isna().any()
data.info()
data.describe()

We can see, although there are no missing values, there are unusual 0.00 values in columns ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’. This must be wrong. Because in any case, Glucose level or blood pressure in the human body cannot be 0. So we must replace these wrong values.

Missing values or wrong values are common in dealing with real-world problems when the data is aggregated over a long time stretch from disparate sources and reliable machine learning modeling demands careful handling of missing data. One strategy is imputing the missing values, with mean, median or mode.

First, we replace the 0.0 value with NaN values.

Then we get the median values of each column, ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’. And impute those values in the place of NaN values.

So our new medians of the columns look like this. We can see they have not changed at all.

3. Outlier detection and treatment

An outlier is a data point that is unusually high or low compared to the other nearby data points. It stands out because it doesn’t follow the general pattern of the rest of the data in a dataset or graph.

Outlier of a data set

Identifying and dealing with outliers is a crucial task in data preprocessing. Outliers can have a detrimental impact on statistical analysis and the training of machine learning algorithms, leading to lower accuracy. Therefore, it is essential to detect and handle outliers effectively.

Boxplots are a great way of detecting outliers. Once the outliers have been detected, they can be imputed with the 5th and 95th percentiles.

#using boxplots to find outliers
plt.figure(figsize=(20,15))

plt.subplot (4,4,1)
plt.title('Pregnancies')
sns.boxplot (data['Pregnancies'])

plt.subplot (4,4,2)
plt.title('Glucose')
sns.boxplot (data['Glucose'])

plt.subplot (4,4,3)
plt.title('BloodPressure')
sns.boxplot (data['BloodPressure'])

plt.subplot (4,4,4)
plt.title('SkinThickness')
sns.boxplot (data[ 'SkinThickness'])

plt.subplot (4,4,5)
plt.title('Insulin')
sns.boxplot (data['Insulin'])

plt.subplot (4,4,6)
plt.title('BMI')
sns.boxplot (data['BMI'])

plt.subplot (4,4,7)
plt.title('DiabetesPedigreeFunction')
sns.boxplot(data['DiabetesPedigreeFunction'])

plt.subplot (4,4,8)
plt.title('Age')
sns.boxplot(data['Age'])
Box plots

The little dots we can see in these box plots are the outliers of the data set.

Percentile capping is an approach used to handle outlier values by replacing them with specific percentiles. Observations below a lower limit are replaced with the 5th percentile value, while observations above an upper limit are replaced with the 95th percentile value from the same dataset. This technique helps to mitigate the impact of outliers on data analysis.

data['Pregnancies']=data['Pregnancies'].clip(lower=data['Pregnancies'].quantile(0.05), upper=data['Pregnancies'].quantile(0.95))
data['Glucose']=data['Glucose'].clip(lower=data['Glucose'].quantile(0.05), upper=data['Glucose'].quantile(0.95))
data['BloodPressure']=data['BloodPressure'].clip(lower=data['BloodPressure'].quantile(0.05), upper=data['BloodPressure'].quantile(0.95))
data['SkinThickness']=data['SkinThickness'].clip(lower=data['SkinThickness'].quantile(0.05), upper=data['SkinThickness'].quantile(0.95))
data['Insulin']=data['Insulin'].clip(lower=data['Insulin'].quantile(0.2), upper=data['Insulin'].quantile(0.85))
data['BMI']=data['BMI'].clip(lower=data['BMI'].quantile(0.05), upper=data['BMI'].quantile(0.95))
data['DiabetesPedigreeFunction']=data['DiabetesPedigreeFunction'].clip(lower=data['DiabetesPedigreeFunction'].quantile(0.05), upper=data['DiabetesPedigreeFunction'].quantile(0.95))
data['Age']=data['Age'].clip(lower=data['Age'].quantile(0.05), upper=data['Age'].quantile(0.95))

Now the box plots show the dataset with lesser outliers.

Though you can see some outliers still present in the ‘Insulin’ plot. To remove those we need to look at the dataset more thoroughly and change the percentile accordingly.

4. Model training

We use train_test_split to separate out training data and test data. Here, the dependent variable, y, is ‘Outcome’ which is patient is diabetic or not. And independent variable, x, is all the other columns without the ‘Outcome’ column.

from sklearn.model_selection import train_test_split

x=data.drop(['Outcome'],axis=1)
y=data['Outcome']
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2,random_state=40)
x and y variables
x-Train and y-Train data sets

Now we create a Gaussian Classifier model using sklearn.naive_bayes

from sklearn.naive_bayes import GaussianNB
#CREATE GUASSIAN CLASSIFIER
model = GaussianNB()

And fit our x-Train and y-Train data to this model and train it.

5. Predicting

Now we predict a set of data corresponding to the x-Test data set.

6. Accuracy test

So our model has an accuracy of 76.62%.

Complete code: Predicting_Diabetic_Patients.ipynb

In the next article, we will train a model with Decision Tree.

Thank you and Happy Reading!

Follow For More.

--

--