Statistics in 5 minutes: Bayes’ Theorem

Recently, I became fired up studying statistics and found out Bayes’ Theorem to be one of the simplest yet most elegant theorems in Statistics. In order to understand the theorem, we will go through an example and probability notations that will guide us to understand the Bayes’ Theorem. This article is based off of Neural Networks for Pattern Recognition by Bishop. It’s a great book for learning basic statistics and machine learning.

We will start from building a specific example. Let’s say we want to classify handwritten alphabet ‘a’ and ‘b’ automatically using a computer algorithm. To make our example simple, we try to classify data only into two categories and do not consider more than two. We have m number of data and we want to classify these data into ‘a’ and ‘b’ by looking at features of each data. For example, we can consider the height to width ratio in this case. We can expect that ‘b’ will have higher height to width ratio compared to ‘a’; in fact, except for some data, this will be the general tendency.

We can write this in terms of probabilistic notation. We let data with some feature l to be {X^l} and we denote class ‘a’ to be C_1 and class ‘b’ to be C_2. We can formalize a concept of prior probabilities by using P(C_1), which represents probability of a character classified to be ‘a’. Naturally, P(C_2) or 1-P(C_1) becomes a probability of a character classified to be ‘b’.

Classification of alphabet depending on features of X

Next, we introduce conditional probability P(X^l|C_k), which specifies the probability of X^l given a condition C_k. In this case, by adding probability of each features {X^1, X^2, X^3 … X^l} given condition C_k will result 1.

The blue square denotes P(X^1|C_1)

Lastly, we introduce joint probability P(X^l, C_k), which specifies the probability of X^l and C_k. Note that this is different from conditional probability in a sense that we consider a specific probability of the whole events instead of conditioned events.

The blue square denotes joint probability P(C_1,X^1)
Formula for calculating joint probability

Now, we have all of the tools to understand Bayes’ Theorem. Bayes Theorem can be driven by manipulating two formula to find joint probability. We can write formula for joint probability as following:

Then we let equation (2) and (3) equal to each other. By manipulating two equations even further, we achieve Bayes’ Theorem.

The term on the left P(C_k|X^l) is called posterior probability. This is the probability we want to find by calculating right hand. When we calculate probability of events, it is much easier to calculate probabilities of prior probability and conditional probability than the posterior probability. This is one of the reason why Bayes’ Theorem is so powerful; we can simply plug in values of conditional probability and prior probability into the equation and find out the posterior probability.

Additionally, posterior probability depend on each feature can be used to minimize the classification error. In our example, our data is distributed to have higher P(C_2|X^l) when X^l is larger. We can interpret this as, as we have higher height to width ratio, probability of data being classified as ‘b’ is higher than that of ‘a’. Thus, by classifying data into ‘b’ whenever we see “high” height to width ratio can minimize our error of misclassification.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.