Modeling Uncertainty- Bayesian Methods for Machine Learning

Published in

DataSeries

6 min readApr 27, 2019

2 schools of thoughts

Every one practicing the gentle art of machine learning tends to have different definitions of what machine learning is. But a true machine learning practitioner will tell you that there are 2 different yet highly correlated views of looking at a machine learning — a classical statistical approach and a probabilistic approach. Although both these approaches end up giving the same or similar results, conceptually they formalize a given problem in fundamentally different ways.

Let me explain the difference between the 2 with a very simple example. Let's say we are tossing a coin. An ardent believer in classical statistics or frequentists would tell you that the outcome would either be heads or tails with a probability of 1/2 and there is nothing we can do about it. On the other hand, a believer in Bayesian statistics would say, if you give me details(prior probabilities and beliefs)of all the initial parameters of the toss, I will predict the outcome of the toss! In this post, I want to explore the later school of thought. How can we model the uncertainties of the world from an arbitrary amount of data? How can we model real-world scenarios and beliefs as a concise structure? How can we use this structure for inference? Read on to find out.

Think Bayesian!

Overview of the process of Bayesian Thought Process

Bayesian thinking is a beautiful, yet elegant way of looking at data. We, as humans, apply the same reasoning techniques as laid down by the principles of Bayesian Thinking. Let me explain the principles of Bayesian Thinking with a rather simple example:

Suppose one day, during your daily walk in the park, you see a man running past you quickly. The question we want to infer is: why is the man running? There are many possible answers to such a question. Let's just list down 4 possible scenarios:

He is in a hurry: Simple, yet elegant reason for why the man could be running. He might be late for his weekly presentation at the office, or for the bus he takes daily to work.
He is exercising: Another explanation could be that he is jogging in the park. Reasonable.
He always runs: Maybe, he has a weird condition in which he always runs or he just likes jogging a lot.
He has seen a dragon in the park: With all the GoT fans out there, I won't be surprised! Bear with me on this.

Now, let's discuss the principles of Bayesian thinking with respect to the above scenario:

1. Use prior knowledge: I want to apologize to GoT fans who believe in dragons, but that's just not happening. In the scenario, we can eliminate the 4th option because it outright can’t happen. So, we used our prior knowledge that dragons don't exist to eliminate an option. That's the first principle.

2. Use your observations wisely: The second principle of Bayesian thinking is to choose answers that explain the observations the most. When you observe the man closely, you see that he is not wearing sports shoes and is not dressed for exercise. We use this observation to eliminate option 2.

3. Don’t make extra assumptions: The last and final principle of Bayesian thinking is about making too many assumptions. The possible answer that the man always runs makes too many assumptions about the man. So, we can safely eliminate option 3.

So, using the principles of Bayesian thinking, we have arrived at the solution that our man is just late for office, something we can all relate with!

How to define a Bayesian model?

The most convenient way to represent a scenario as a Bayesian model is called a Bayesian Network or Bayesian Belief Network. A Bayesian Network is a graph in which the nodes are random variables and the edges indicate a direct impact between 2 random variables. Edges can also be thought of as an encoding of the influence between 2 random variables. Let’s understand the structure of a Bayes Net with a simple example:

An example of a Bayesian Network to model real-world scenarios

The above graph shows a very simple and primitive Bayesian Network for the scenario of a garden. In the scenario, 2 events can result in the grass being wet. Either the wetness of the grass is due to the sprinkler or due to rain. If it is raining, there is less probability of the sprinkler is turned on. So, in a sense, the rain has a direct effect on the sprinkler. This scenario is encoded in the above Bayesian Network. Now, let’s understand some key properties of a Bayesian Network and how it’s a useful representation of the Bayesian thought process.

Properties of a Bayesian Network

A Bayesian Network is a Directed-Acyclic Graph(DAG): This means that a Bayesian Network cannot have directed cycles in it.
The joint probability distribution can be directly inferred from the graph: The joint probability distribution of the random variables in the graph can be directly inferred from the structure of the graph using the formula:
P(X1,…, Xn) = Πi P(Xi | ParG(Xi )) where X1,…, Xn are the random variables in the graph and ParG(Xi) is the parents of node Xi in the graph.
Using this formula, we get that the joint probability distribution for the network shown above is given by:
P(G,S,R)= P(G|S,R)*P(S|R)*P(R)
where G=Grass wet, S= sprinkler is on and R=rain
Inference and Learning are made easy by Bayesian Network representation: It turns out that there are a number of efficient learning and inferencing algorithms that make use of the structure of Bayesian Networks.

Naive Bayes Model as a Bayesian Network

The naive Bayes model is one of the machine learning models which makes use of the concepts described above. The naive Bayes model has found many applications across fields and is still one of the accepted benchmark models for many real-world applications like spam detection, classification of newspaper articles and sentiment analysis of text reviews. In this section, we will represent the naive Bayes model as a Bayesian Belief Network and analyze the main features of the model in that aspect.

The Bayesian Belief Network structure of Naive Bayes Classifier

The graph above shows the Bayesian Network graph structure for the Naive Bayes Classifier. In the scenario, we have a class variable C and feature variable X1,…., Xn. The class variable directly influences the features that we see. Let’s decode the structure of this Bayesian Network to understand some key properties of the Naive Bayes Classifier

Observations made

Using the formula already shown, we can derive the joint probability distribution represented by the naive Bayes classifier:
P(C, X1,….., Xn)=P(C)*P(C|X1)*P(C|X2)*…….*P(C|Xn)
In the formula above, P(C) is called the prior probability of the class. This probability encodes domain knowledge in some sense. It also is a way of introducing regularization into the model to avoid overfitting.
The probabilities P(C|X1),….., P(C|Xn) are called the evidence or the posterior likelihood of the class given the feature values. These posterior likelihoods are calculated using the Bayes Theorem as follows:

4. From the structure of the graph, we can make out clearly that given a class C, all the Xs are assumed to independent of each other. This is a naive assumption that gives the classifier makes. We can deduce this, just from the structure of the graph.

Summary

In this post, I have given a gentle introduction into the school of thought known as Bayesian thinking for statistics and machine learning. Modeling uncertainty is a huge task in computation as well as mathematics. The concepts of Bayes Theorem and Bayesian Networks help us to model real-world uncertainties in a formal way such that it becomes easy to infer as well as learn how randomness can be quantified.