# Statistics in 5 minutes: Bayes’ Theorem

Recently, I became fired up studying statistics and found out Bayes’ Theorem to be one of the simplest yet most elegant theorems in Statistics. In order to understand the theorem, we will go through an example and probability notations that will guide us to understand the Bayes’ Theorem. This article is based off of *Neural Networks for Pattern Recognition* by Bishop. It’s a great book for learning basic statistics and machine learning.

We will start from building a specific example. Let’s say we want to classify handwritten alphabet ‘a’ and ‘b’ automatically using a computer algorithm. To make our example simple, we try to classify data only into two categories and do not consider more than two. We have *m *number of data and we want to classify these data into ‘a’ and ‘b’ by looking at features of each data. For example, we can consider the height to width ratio in this case. We can expect that ‘b’ will have higher height to width ratio compared to ‘a’; in fact, except for some data, this will be the general tendency.

We can write this in terms of probabilistic notation. We let data with some feature *l* to be *{X^l} *and we denote class ‘a’ to be C_1 and class ‘b’ to be C_2. We can formalize a concept of *prior probabilities *by using *P(C_1), *which represents probability of a character classified to be ‘a’. Naturally, *P(C_2) *or *1-P(C_1) *becomes a probability of a character classified to be ‘b’.

Next, we introduce *conditional probability* *P(X^l|C_k), *which specifies the probability of *X^l given* a condition *C_k. *In this case, by adding probability of each features {X^1, X^2, X^3 … X^l} given condition C_k will result 1.

Lastly, we introduce *joint probability P(X^l, C_k), *which specifies the probability of *X^l and* *C_k. *Note that this is different from conditional probability in a sense that we consider a specific probability of the whole events instead of conditioned events.

Now, we have all of the tools to understand Bayes’ Theorem. Bayes Theorem can be driven by manipulating two formula to find joint probability. We can write formula for joint probability as following:

Then we let equation (2) and (3) equal to each other. By manipulating two equations even further, we achieve Bayes’ Theorem.

The term on the left *P(C_k|X^l) *is called *posterior probability.* This is the probability we want to find by calculating right hand. When we calculate probability of events, it is much easier to calculate probabilities of prior probability and conditional probability than the posterior probability. This is one of the reason why Bayes’ Theorem is so powerful; we can simply plug in values of conditional probability and prior probability into the equation and find out the posterior probability.

Additionally, posterior probability depend on each feature can be used to minimize the classification error. In our example, our data is distributed to have higher P(C_2|X^l) when X^l is larger. We can interpret this as, as we have higher height to width ratio, probability of data being classified as ‘b’ is higher than that of ‘a’. Thus, by classifying data into ‘b’ whenever we see “high” height to width ratio can minimize our error of misclassification.