The Frequentist vs Bayesian Debate

Omar Ayman
Dec 30, 2019 · 8 min read

This is story is the sequel to

Some statistics

  • Probability distribution — The function that defines the probability of different outcomes/values of a random variable. The continuous probability distributions are described using probability density functions whereas discrete probability distributions can be represented using probability mass functions.
  • Conditional probability — This is a measure of probability $P(A|B)$ of an event A given that another event B has occurred.

Frequentist vs bayesian debate

Frequentist Linear Regression

where y is the 1 point of estimation (label) and x is the data points and a is what is called the bias.

The goal of learning a linear model from training data is to find the coefficients, β, that best explain the data. In frequentist linear regression, the best explanation is taken to mean the coefficients, β, that minimize the residual sum of squares (RSS). RSS is the total of the squared differences between the known values (y) and the predicted model outputs (ŷ, pronounced y-hat indicating an estimate). The residual sum of squares is a function of the model parameters:

The summation is taken over the N data points in the training set, The closed form solution expressed in matrix form is

This approach is based on the maximum likelihood estimate of beta, now I would like to give you the generic formula for any model based on estimating the maximum likelihood and not just linear regression.

consider data from X0 to X10, below you can find the equation estimating the weights Theta

This product over many probabilities can be inconvenient for various reasons.For example, it is prone to numerical underflow/overflow we observe that taking the logarithm of the likelihood does not change its arg max but does conveniently transform a product into sum

so the equation transformed into this form!

So we hear a lot that the model weights should maximize the log likelihood of a certain label this why we hear a “log” into the conversation just to alleviate the underflow problem of values

Bayesian statistics

Bayesian Thinking

Suppose that you are allowed to flip the coin $10$ times in order to determine the fairness of the coin. Your observations from the experiment will fall under one of the following cases:

  • Case 1: observing $5$ heads and $5$ tails.
  • Case 2: observing $h$ heads and $10-h$ tails, where $h\neq 10-h$.

If case 1 is observed, you are now more certain that the coin is a fair coin, and you will decide that the probability of observing heads is $0.5$ with more confidence. If case 2 is observed you can either:

  1. Neglect your prior beliefs since now you have new data, decide the probability of observing heads is $h/10$ by solely depending on recent observations.
  2. Adjust your belief accordingly to the value of $h$ that you have just observed, and decide the probability of observing heads using your recent observations.

The first method suggests that we use the frequentist method, where we omit our beliefs when making decisions. However, the second method seems to be more convenient because $10$ coins are insufficient to determine the fairness of a coin. Therefore, we can make better decisions by combining our recent observations and beliefs that we have gained through our past experiences. It is this thinking model which uses our most recent observations together with our beliefs or inclination for critical thinking that is known as Bayesian thinking.

Bayesian learning is now used in a wide range of machine learning models such as,

Bayes’ Theorem

Mathematical definition

How does Bayes’ Theorem allow us to incorporate prior beliefs?

Let A represent the event that we sell ice cream and B be the event of the weather. Then we might ask what is the probability of selling ice cream on any given day given the type of weather? Mathematically this is written as P(A=ice cream sale | B = type of weather) which is equivalent to the left hand side of the equation.

P(A) on the right hand side is the expression that is known as the prior. In our example this is P(A = ice cream sale), i.e. the (marginal) probability of selling ice cream regardless of the type of weather outside. P(A) is known as the prior because we might already know the marginal probability of the sale of ice cream. For example, I could look at data that said 30 people out of a potential 100 actually bought ice cream at some shop somewhere. So my P(A = ice cream sale) = 30/100 = 0.3, prior to me knowing anything about the weather. This is how Bayes’ Theorem allows us to incorporate prior information.

Using Bayes’ theorem with distributions

so back to machine learning and using the example of the probability of selling ice cream given one of the prior knowledge that selling ice cream in a sunny day = 0.3 but what if this was only the best guess what if there is a margin of estimation in a range from 0.25 to 0.3, this is what I am talking about just giving a margin of distribution for every estimation and it works as well in linear regression remember the equation? y = bx + c, what if our model estimated that b is in a range from 0.4 to 0.6 and this leads us to range of prediction, so why would I use this, let me tell you when a domain expertise having a data that he thinks that he understands enough that's what is called prior knowledge so giving him a range of best weights and range of best predictions he can now choose what the best parameters(weights) for the problem he is facing , It’s widely used in machine learning. Bayesian model averaging is a common supervised learning algorithm. Naïve Bayes classifiers are common in classification tasks. Bayesian are used in deep learning these days, which allows deep learning algorithms to learn from small datasets.

Converting classical frequentist linear model to bayesian

The difference here is that instead of representing each variable having one value is that now each variable has its own distribution (set of values)

  1. b— Normal distribution
  2. a — Normal distribution


log again actually is pretty helpful to use logarithm in order to mitigate the problem we talked about earlier so here divergence is just the difference between the distribution of the data and the model. if you need more info about KL divergence, check this blog

In the next blog, We will explore implementing models based on bayesian inference using the Python language and the PyMC3 probabilistic programming framework.

This chapter will be completed in another story where I will talk about supervised and unsupervised algorithms from inside!



Data Driven Investor

from confusion to clarity, not insanity

Omar Ayman

Written by

Deep learning researcher

Data Driven Investor

from confusion to clarity, not insanity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade