Applying Gaussian Naïve Bayes Classifier in Python: Part One

Naïve Bayes classifier is one of the most effective machine learning algorithms implemented in machine learning projects and distributed MapReduce implementations leveraging Apache Spark. Primarily Naïve Bayes is a linear classifier, which is a supervised machine learning method and works as a probabilistic classifier as well. Most of the time, for the numeric implementations K-Nearest Neighbors and K-Means clustering algorithms can be implemented. Naïve Bayes classifier works effectively for classifying emails, texts, symbols, and names. It’s not unusual Naïve Bayes classifier is used for numeric data as well in some instances. Naïve Bayes classifier can be implemented on high-dimensional datasets effectively as well. Naïve Bayes classifier predicts the probability of each class based on the feature vector for text classification for continuous big data with a prior distribution of the probability, tackling the challenges of the curse of the dimensionality. There are three types of Naïve Bayes classifiers. When handling real-time data with continuous distribution, Naïve Bayes classifier considers that the big data is generated through a Gaussian process with normal distribution. Multinomial Naïve Bayes classifier can be applied when handling event models where the events are modeled through a multinomial distribution. In this situation, the features are frequencies. In the third scenario, when the features are Boolean or independent, the features are generated through a Bernoullian process. In this scenario, a Bernoulli Naïve Bayes classifier can be applied.

When dealing with Gaussian Naïve Bayes classifier, the outcome model will have a high-performance with high training speed with the capabilities to predict the probability of the feature that belongs to Zk class. The aim of the Naïve Bayes classifier to compute the ith observation would be by computing the following probability.

Probability of (Zk | X (i) ) .

The X(i) can be represented as the vector of the number of features with the number of components as [X(i, 0), X(i, 1), X(i, 2), X(i, 3)]. Applying Naïve Bayes rule, it can be represented as follows:

Prob (Zk | X(i) ) = Prob (Zk ) Prob (X (i) | Zk ) / Prob (X (i) ) .

The posterior probability is the prior probability of the class multiplied by the likelihood and the outcome will be divided by the evidence from the big data. The joint probability can be represented as:

Prob (Zk, X(i,0) ), …. X(i,n) ) ) = Prob (X(i,0) ), …. X(i,n) | Zk ) ).

The factor of the multiplication can be rephrased and expressed as follows:

Prob (X (i,0) | Zk ) Prob ( X (x, 1),…. X(i, n) | Zk, X(i, 0)).

By applying the conditional probability, the second member of the definition can be expressed as:

Prob (Zk, X(i,0),…. X (i,n)) = Prob (Zk ) Prob (X (i, 0) | Zk ) Prob (X (i,1) | Zk, X (i,0))…

It can be also represented as:

Prob (X,(i,0) | Zk, X(:)) = Prob (X (i,1) | Zk, X (i, 0)).

When applying Gaussian Naïve Bayes classifier, the data from each of the label is drawn through Gaussian distribution.

The data can be generated with

from sklearn.datasets import make_blobs

X, y = make_blobs(100, 2, centers=2, random_state=2, cluster_std=1.5) plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap=’RdBu’);

Multivariate Gaussian Distribution of Big Data for Gaussian Naive Bayes Classification
Data Visualization of Gaussian Naive Bayes Model

I’ve written code in Python and uploaded my code and dataset into Github repository. Check it out at https://github.com/GPSingularity/Machine-Learning-in-Python

Coding in PyCharm

References

Boschetti, A., & Massaron, L. (2016). Python Data Science Essentials — Second Edition (2 ed.). Birmingham, England: Packt Publishing — ebooks Account.

VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data (1 ed.). Sebastopol, CA: O’Reilly Media.