Machine Learning Deep Dive #1: Bayesian Decision Theory

Buse Bilgin
turkcell
Published in
6 min readNov 19, 2022

Welcome to the first article of the “Machine Learning Deep Dive” biweekly series. Each article will consist of a theoretical summary based on Ethem Alpaydın’s Machine Learning book. For each method, there will be a separate GitHub repository containing sample codes and examples prepared from the scratch and/or using the open-source Python libraries. Wish you pleasant reading!

Introduction: For Those Who Miss Probability Theory

Naive Bayes is one of the simple, powerful, and fast classification algorithms, and a good starting point for Machine Learning! It is based on the Bayes Theorem which explains the connection between conditional probabilities of statistical quantities. This model is based on a strong (and improbable in real data) assumption: every pair of features is conditionally independent. However, the method performs surprisingly well on data even where this assumption is false. So, let’s dive in!

Courtesy: https://data-flair.training/blogs/bayes-theorem-data-science/

To understand the Bayesian Decision algorithm, we need to first understand the theory in the background (“winter is coming” for those who don’t like probability theory). Let’s think of a random process where we can’t predict exactly what the outcome will be in any given situation: like flipping a coin. In such cases, we can only talk about possibilities. The result is controlled by what we called the unobservable variables (i.e. the composition of the coin, its initial position, the force, etc.). Since the unobservable variables cannot be observed, it is not a piece of useful information that we can use to predict future results. The best we can do is to try to build a model on the observable variables. When we consider the toss scenario, this value is the results of the experiment: heads or tails. Now, our goal is crystal clear: to create an algorithm where we can predict the result of future experiments from the results obtained from the previous experiments.

If we tossed the coin 10 times and 6 of them were heads, the probability of tossing a head is P(H) = 0.6. The probability that we calculate using the results of the previous events is called the prior probability. Although it helps us to gain important insight related to the previous experiments, it is insufficient on its own as it does not involve the result of the current experiment when making a future prediction. We need to enrich our calculation with additional parameters to increase the consistency of our future predictions.

There is another parameter in probability theory that we frequently encounter and that can be a solution to our problem: likelihood. The main difference between probability and likelihood is that while the likelihood is attached to hypotheses, the probability is attached to potential outcomes. Likelihood tells us how likely the result is to be C given a set of circumstances x. When we combine the prior probability and the class likelihood, we can see the picture as a whole — which is exactly what Thomas Bayes did! And here comes Bayesian Decision Theory!

We’re already familiar with the variables on the numerator, but there is a new player in the denominator: evidence. Evidence is described as the number of times x has occurred, regardless of which class it belongs to. Evidence normalizes the calculated posterior probabilities to sum up to 1. After we calculate the posterior probabilities, we choose the class with the highest posterior probability for making predictions.

Losses and Risks: Unsymmetric Situations

In some scenarios, the decisions are not equally good or costly. If we think about the familiar scenario of the COVID-19 pandemic, diagnosing a healthy person as positive and an infected person as negative have totally different practical outcomes. When this is the case, we need an additional parameter that can take this kind of risk into account during the loss calculation: risk and action parameters. In other words, we need to define a function that calculates the risk according to the selected action (α) using the loss (λ) incurred from not selecting other classes.

Our goal is to choose the class that has the minimum risk. What about the loss parameter λ? The loss can be determined in a number of ways. One of the popular approaches is called “0/1 loss”. The 0/1 loss method selects the loss parameters using a very simple algorithm: If the prediction is true, then there is no loss (λ = 0); if not, the loss parameter is set to λ = 1. This implementation enables simplification in the risk function. We can rewrite the risk function for the 0/1 loss approach as follows:

So, in order to reduce risk, we should select the most probable class. As you can guess, this is one of the simplest postprocessing methods to go from the posteriors to risks and to take the action to minimize them.

But sometimes this method may not work. In some cases, the cost of making a mistake can be very high, so our goal may be to reduce the error rate to zero. Let’s think about the shipping process, none of us want a product we ordered to go to another address as a result of an error in the machine-learning model. In such cases, we need to define an additional action: reject (αₖ₊₁). The number of classes has increased from K to K+1. The classes from k=1 to k=K are the usual classes, while the K+1 class is the reject class. The reject class has a loss of λ, which is a value between 0 and 1:

Isn’t the equation like the combination of loss definitions we’ve seen before? When we look into the risk of reject, it is the same relation as the first loss definition that we defined, on the contrary, the risk of choosing a class is the same as the 0/1 loss calculation. The optimal decision rule is to:

This entire strategy makes sense if 0 <λ< 1. We always reject if λ = 0; a reject is equivalent to a correct classification. If λ ≥ 1, we never reject because a rejection is just as expensive as an error, if not more so.

Python Implementation: Let’s Get Our Hands Dirty!

Let’s write some code now! I have prepared a GitHub repository with the code of the Bayesian Decision algorithm written from scratch and using the predefined functions of the Scikit-Learn library. You can perform performance analysis by comparing the results of two applications with each other. You can find the repository here.

If you want to talk about machine learning or about my article, you can contact me via my LinkedIn account. Stay tuned for the second paper of the series!

--

--

Buse Bilgin
turkcell

Electronics Engineer || ML Enthusiast || R&D Engineer || Technology Follower