Understanding Gaussian Classifier

Rina Buoy

Published in

The Startup

5 min readJun 12, 2019

Experience is a comb which nature gives us when we are bald. ~Proverb

Overview

The normal distribution is perhaps one of the commonly used statistical distribution functions. It is basically everywhere. Chances are that we have already used it directly or indirectly. In this article, we discuss univariate & multivariate normal distribution, and how we can derive a generative (more on that later) Gaussian classifier using Bayes’ theorem. In my opinion, classifiers like Gaussian are simple yet intuitive and interpretable.

The main reference for this article is Machine Learning: A Probabilistic Perspective by Kevin P. Murphy

Normal/Gaussian Distribution

In real life, there are always situations in which we have uncertainty and doubt over some phenomena. For example, if someone asks us about the weight of a random adult, we cannot come up with a deterministic answer because we know that all not adults have the same weight. The best answer you can get is average weight and term average already imbeds our uncertainty of adult weight. Formally, uncertainty over some phenomena leads to the concept of a random variable (typically denoted by x) which is mathematically modelled by a probability distribution.

A normal distribution is one class of distributions which has a bell shape as below image.

Bell Shape of Normal Distribution (Wikipedia)

The probability density function of a univariate normal distribution is given by:

Technically, we call it a probability density of x given by mean and variance.

A univariate distribution is suitable when we want to express our uncertainty over a quantity like adult weight. What if we want to express our uncertainty over multiple quantities (joint random variables — adult weight and height)? This leads to a multivariate normal distribution, the equation of which is given below:

Σ is a covariance matrix. Function symbol N and f are used interchangeably.

Effects of the covariance matrix or the correlation among random variables are given below:

Maximum Likelihood Estimate (MLE) of Mean and Variance

Imagine that we are given a bunch of adult weights and we want to find a normal distribution model ( mean, variance) to represent the data. To find optimal mean and variance which maximize the likelihood of the data, MLE is used. Look at the below image which clearly explains the concept of MLE. The bell-shaped red curves are given theoretical normal distribution functions using varying values of the mean ( values increasing from left to right). The likelihood of observed data peaks at the mean value which coincides with sample average. The same process applies to the estimate of variance.

StatQuest: https://www.youtube.com/watch?v=XepXtl9YKwc&t=107s

Therefore, the MLE of mean and variance:

Enough for background theories. Let’s dive into Gaussian classifier.

Gaussian Classifier

Let’s imagine that we are given a training dataset which falls into two classes (1 and 2 — binary classification). The objective here is to predict the class to which new data belongs. In other words, for a given new data(x), we want to estimate p ( y = 1|x) and p ( y = 2|x). X is assigned to any class which has the highest probability (make sense ?). Bayes’ theorem can help us with estimating p ( y = 1|x) and p ( y = 2|x).

p ( y = c|x) is p ( y = 1|x) and p ( y = 2|x)

p ( x|y = c ) is assumed to be Gaussian/normal distribution.

p ( y = c) is a class prior which is ratio of #class c/ #total sample.

For the purpose of classification, we are only interested in nominator so we can ignore the normalisation constant which is aimed to make class posterior a valid probability distribution. For the curious reader, please refer to the above reference.

All terms in nominator can be estimated from the training dataset.

It is called ‘Gaussian’ classifier because of the assumption that p ( x|y = c ) is Gaussian distribution. It is also known as ‘Mixture Gaussian’ and ‘Discriminant’ classifier.

There are two variants of Gaussian classifier, depending on whether covariance matrices of classes are assumed to be equal or not. Covariance matrix assumption has an impact on the class boundary. Shared covariance matrix leads to the linear boundary while separate covariance matrices lead to the quadratic boundary.

Binary Gaussian Classifier Implementation in Python

Now let’s get it real. To apply all the above theory and for the sake of simplicity, we implement Gaussian classifier for simple binary classification in Python. We create GaussianClassifier class with two methods — train() and predict().

We test our implementation using breast cancer dataset. We use 67% of the data as the training set and keep the rest as the test set. We print out accuracy score and confusion.

Here are the resulting confusion matrix and accuracy score which are not bad at all.

Generative vs Discriminative

A Gaussian classifier is a generative approach in the sense that it attempts to model class posterior as well as input class-conditional distribution. Therefore, we can generate new samples in input space with a Gaussian classifier.

On the contrary, a method like logistic regression (another article of mine: https://medium.com/@rinabuoy13/logistic-regression-with-pytorch-b67ebd75a814) is discriminative since it attempts to model class posterior directly. So not all classification methods are the same.