Introduction to Bayesian Statistics for Data Science and Analytics (Part -1)

Published in

Analytics Vidhya

6 min readOct 29, 2020

Over the years, the data science domain has assumed great significance with the exponential growth of data. A solid foundation in the underlying mathematical concepts and statistics is vital to master data science and analytics. Bayesian statistics is a must-know for all data science and analytics professionals since data science has deep roots in the Bayesian approach.

In this article, we will look into:

1) What is Bayesian Statistics?

2) Bayesian statistics vs. Frequentist (Classical) statistics

3) Bayes’ theorem

The next article (Part-2) will deal with Bayesian inference and the diverse applications of Bayesian statistics in data science and analytics.

1. What is Bayesian Statistics?

Bayesian statistics is a mathematical approach that involves the application of probability (mostly conditional probability) to solve statistical problems.

This approach involves initial “prior” beliefs (or probabilities) about an event which is updated when new evidence emerges through data collection. This results in ‘posterior’ beliefs which form the basis for Bayesian inferences. Often, people tend to overlook the prior probability of an event whereas posterior probability is always considered.

Before we actually delve deeper into Bayesian Statistics, let us briefly discuss Frequentist Statistics, the more popular version of statistics and the distinction between these two statistical philosophies.

2. Bayesian statistics vs. Frequentist (Classical) statistics

Firstly, we have to realize that there is a thin line between these two alternative approaches.

a) What is Frequentist (Classical) statistics?

It is the most widely used inferential method in statistics.

According to Frequentist approach, the probability of an event is the frequency of occurrence of that particular event in the long run of the experiment (i.e., it involves repeated trials under the same conditions).

Let’s consider the example of tossing a coin to determine whether it’s fair or not. (Theoretically, the experiment is supposed to be repeated infinite number of times but practically it can only be repeated a large (finite) number of times).

The following table represents the frequency of heads and tails

We can conclude that this is a fair coin since the probability of getting a head is 0.5.

It is evident from this observation that the result of an experiment is dependent on the number of times the experiment is repeated. This is a major drawback of the Frequentist approach.

b) Distinction between Frequentist and Bayesian approach

We have seen that the Frequentist definition of probability is based on the long-term frequency of the event occurring when the same experiment is repeated multiple times. This is in contrast to the Bayesian definition according to which probability is measured by the belief of the likelihood of a particular outcome.

In case of some events, one approach makes more sense than the other. Frequentist statistics is applicable to events such as flipping of a coin, rolling a die, picking a card from a deck etc. that are random as well as repeatable whereas Bayesian approach allows to assign probabilities to events that are neither random nor repeatable. For instance, according to Bayesian approach, it is acceptable to assign a probability to an event like Joe Biden winning the 2020 U.S. presidential race. But, if Frequentist approach is followed, this wouldn’t make much sense since we cannot perform repeated trials (the candidate only ever contests once for this particular election) unless we go for virtual trials.

One major advantage that Bayesian approach has is that it takes the prior knowledge into consideration while calculating probability by applying Bayes’ rule.

Most of us are familiar with Bayes’ Theorem in probability. Let’s delve deeper into this concept.

3. Bayes’ theorem

a) Conditional probability

A good grasp of the concept of conditional probability is essential to understand Bayes’ theorem.

Conditional probability is the probability of an event A, given that another event B has already occurred. This is represented by P(A|B) and can be defined as:

where P(B) ≠ 0

Example: In a card game, suppose a player needs to draw a black card which is a King in order to win. It is given that Lily has drawn a black card. What is the probability of her winning the game?

Solution:

P (A ∩ B) = P (Obtaining a black card which is a King) = 2/52

P(B) = P (Picking a black card) = 1/2

Thus, P(A|B) = 4/52.

b) What is Bayes’ Theorem?

Let’s consider the following equations:

and

From these two equations, we can conclude that:

This is the Bayes’ Theorem.

Here,

Prior (the probability of event A to occur) refers to the preconceived beliefs we hold.

Likelihood (the probability of event B being true given that event A is true) refers to the probability of observing what we did given that our priors are true.

Posterior (the probability of A to occur given event B already occurred) refers to the updated prior based on what has been observed.

We can actually write

This can be substituted in Bayes’ theorem to obtain an alternative version, which is applied a lot in Bayesian inference:

Example: A situation where Bayesian analysis is routinely used is the spam filter in your mail server. The message is scrutinized for the appearance of key words which make it likely that the message is spam. Imagine that the evidence for spam is that the subject message of the mail contains the sentence “check this out”. We define events:

• S which means the message is spam.

• C which means the subject line contains the sentence “check this out”.

Compute the conditional probability P(S|C) when 40% of emails are spam and 1% of spam email have “check this out” in the subject line while 0 .4% of non-spam emails have this sentence in the subject line.

Solution: Using Bayes’ formula,

P(S|C) = P(C|S). P(S) / P(C)

Now we have P(S) = 0.4 and P(C|S) = 0.01.

P(C) = P(C|S). P(S) + P(C|S’). P(S’) = 0.01 x 0.4 + 0.004 x 0.6 = 0.0064 where S’ stands for the event which means that the message is not spam.

Thus, P(S|C) = 0.01 x 0.4 / 0.0064 = 5/8 = 0.625

I’ll be back soon with the next article (Part-2) that deals with Bayesian inference and the diverse applications of Bayesian statistics in data science and analytics.

Introduction to Bayesian Statistics for Data Science and Analytics (Part -1)

Written by LSS