## Machine Learning

# Everyone Can Understand Machine Learning — Naive Bayes Classification

## Let the algorithm guessing the sex of a person

Naive Bayes is one of the most classification algorithms in the classic machine learning area. It is completely based on the famous Bayes Theorem in Probability.

Don’t be scared by the words “machine learning” or “algorithm”. In this article, I’ll introduce how the Naive Bayes Classification works as a machine learning algorithm.

I always try my best to explain machine learning algorithms in plain words without any scientific expressions in this “Everyone Can Understand Machine Learning” series. However, I would expect you to have at least basic probability knowledge to understand this one. Don’t worry, you don’t have to be a statistician, just know something like the probability of tossing a coin to get a face is 50% :)

In this article, I will first introduce what is the Bayes Theorem. Then, I’ll give a simple example that using Bayes rules to correct our “first intuition”. After that, why the machine learning algorithm is called “Naive Bayes”, why “Naive”? Finally, I will give a practical example to explain how a machine can guess the sex of a person.

Bear in mind, you don’t need to be a professional to understand this article. So, relax and enjoy the reading!

# What is Bayes Theorem?

To understand what is Bayes, we need to firstly look at what is a “normal” probability. Let’s say we have 5 red balls and 5 blue balls that are put into a bag. If I ask you how much chance you will get a red ball from randomly picking up one ball from the bag, it is very obviously 50%. Specifically, it is 5/(5+5)=0.5.

This kind of probability is what we are familiar with. However, it is not uncommon to be agnostic for some facts in practice. For example, what if we don’t know how many red balls and blue balls in the bag?

Let’s consider a practical case as follow. Suppose there is a disease called “X disease”. From the general medical report, we know that 1 in 10,000 in our population has this disease. Now, there is a medical test called “X-Test” that can test whether a person has this X disease or not, and the accuracy is 99.9%. Then, suppose there is such a person has been tested with X-Test and got a positive result. How much chance this person really got the X disease? Is that going to be 99.9% just because the accuracy of the test is 99.9%?

No! Don’t use your intuition to decide this probability. This is not a normal probability. It is an “inversed” probability. This problem is somehow equivalent to ask:

I have a bag with 10 balls. Then, I have tried to pick up a ball from the bag and put it back for 10 times, and every time it is a red ball. How much chance there are only red balls in the bag?

Normal probability can’t help in this case. However, we can try to derive the inversed probability in this way:

- The X-Test accuracy is 99.9%, so there is a 0.1% chance to make mistake. A mistake means that a person without X disease is tested with positive.

2. Suppose there are 10,000 people there. If we apply X-Test to all of them, there will be 10,000 * 0.1% = 10 persons who don’t have the X disease but the results are positive by mistake.

3. We can expect that there will be 1 person who really has the X disease because we know that 1 in 10,000 will have this disease.

4. So, there will be 11 persons are positive in X-Test, but only one of them really has the X disease.

5. Therefore, if one is tested positive, there is only a 1/11 chance (about 9%) really has the X disease.

The result, 9%, is kind of violating our intuitive, right? But if my explanation makes sense to you, that is the correct probability based on the things we’ve known. This is the typical application of Bayes Theorem.

# Formulate the Bayes Theorem

However, the example above is just an explanation. Before we can use the Bayes Theorem to train our machine learning model, we need to generalise the rule to something that can be applied to other problems.

Let’s suppose that “Pos” stands for the event of a person being tested positive. This is actually the only “attribute” we have in this case. Then, let’s use “T” for True Positive referring to the person really has the X disease and “F” for False Positive that do not actually have.

One more thing we need to know is a notation for “conditional probability”. Say, `P(Pos|T)`

means that the probability of a person being tested with positive given that this person does have the X disease. Similarly, `P(Pos|F)`

stands for the probability of a person is tested with positive given that this person does not have the X disease.

Let’s summarise what we have known already.

`P(Pos|T) = 99.9% = 0.999`

`P(Pos|F) = 0.1% = 0.001`

`P(T) = 1/10000 = 0.0001`

`P(F) = 9999/10000 = 0.9999`

Then, the formula of Bayes Theorem is as follows.

Don’t be scared by the formula. We don’t need to understand how the formula is derived. We just need to know that this is the Bayes Theorem that is used by most of the Naive Bayes Classification algorithm. What we gonna do is just plugging the known facts into this formula and calculate the results.

The first equation is calculating `P(T|Pos)`

, which is exactly “the probability that a person really has the X disease if this person is tested positive”. The second one is calculating “the probability that a person does not have the X disease if this person is tested positive”. Let’s plug in the known numbers into the first equation.

We can treat this example as a machine learning problem that is classifying whether a person has the X disease or not. In this case, we only have one attribute, which is the test result. However, we may have multiple attributes in practice. I will use an example with multiple attributes in the case study later on.

# Why “Naive”?

Before the case study of the machine learning algorithm, let me explain why Naive Bayes Classification is called “Naive”.

If you have some basic probability and statistics knowledge, you may have already found that all the 10,000 people we used in the previous example are “independent”. Basically, we can say that the X-Test result of each person is independent with others.

However, independence generally cannot be guaranteed in practice. For example, one of the typical application of Naive Bayes Classification is to classify whether a piece of review of a restaurant is positive or negative. In this case, the Naive Bayes Classifier

- Treat every word in the review comment equally and independently
- Does not consider the order of the words

Although the order of the words does actually matter in most of the human languages, the algorithm will ignore them. That’s why it is so-called “naive”. In fact, even though it is naive, it has been proven that it works very well in some circumstances and become one of the most popular classification algorithms.

Additionally, there are two types of Naive Bayes Classification in general.

- Multinomial Naive Bayes, which deals with discrete values attributes
- Gaussian Naive Bayes, which deals with continuous values attributes

In the case study of this article, we will focus on the discrete scenario for convenience.

# Case Study — Guessing the Sex of People

As the title of this article, we cannot train our model to be able to “guess” the sex of a person, given the person’s height, weight and shoe size. Here is our training data.

Let’s have the question then, which is the data to be classified.

So, the problem we need to solve is “the probability that a person is a **Male/Female** **given** that the person’s **height is High**, **weight is Medium** and **shoe size is Medium**”. Based on the Bayes Formula we have shown in the above section, the equation we need to solve for this problem is as follows.

Again, don’t be scared by the equation, let me explain.

In this case, we only need to calculate the conditional probability of either Male or Female since we only have to classifications. For example, the conditional probability of Female must the 100% minus the one of Male.

But wait, there is something that looks unfamiliar. Don’t worry, don’t be tricked by the conditional probability. Please see below:

It is not difficult to understand this. For example, the first equation, the conditional probability on the left is “the probability of a person has the height is High, weight is Medium and shoe size is Medium given that the person is a Male”. From the sentence, it is not difficult to figure out that the 3 attributes are of “AND” relationships. That is, the person must satisfy all of these 3 conditions. Therefore, the probability can be decomposed into the multiplication of 3 conditional probability on the right.

Now, let’s figure out the “known probability” from our training data.

OK. We have got enough known probabilities to solve the problem. Now, let’s plug in these numbers to the equation.

Therefore, the person has an 85.71% probability to be a Male. As above-mentioned, we don’t need to calculate the probability of the person to be a Female because it must be 14.29%, which is 100% subtracting 85.71%.

# Summary

In this article, I have introduced the Bayes Theorem that is utilised as one of the most popular machine learning algorithms — Naive Bayes Classification.

Hope you now understand what is Bayes Theorem, why the Naive Bayes is “naive”, as well as how the machine learning algorithm works to guess the sex of a person from the height, weight and shoe size.

The case study used in this article is called Multinomial Naive Bayes, which can only be utilised when the attributes are discrete or categorical, such as “high”, “medium” and “low”. Keep an eye on my story list and I will introduce the Gaussian Naive Bayes by one of my further articles!