How This Frequentist Turned Bayesian

“Objectivity” is a misguided goal

Wicaksono Wijono
Analytics Vidhya
8 min readFeb 14, 2020

--

This topic is overdone, but hear me out. Just two years ago, I would prattle on about the importance of correcting p-values, but now I am strongly Bayesian. Why? Partly because of McElreath’s excellent book, Statistical Rethinking. (If you plan on buying it, wait until March for the 2nd edition.)

As the title suggests, the book makes us rethink the statistical traditions that have been so ingrained. Have you seen a chart like this?

Source

It’s hideous. The Twitter replies say it best:

No wonder a lot of people hate statistics. Introductory statistics courses teach formulas and rules. It’s procedural. Ritualistic. Statistics sounds like jumping through hoops.

It was not immediately apparent to me, but Bayesian statistics (together with causality) frees us from much of ritual and lets us focus on the things that matter: human judgment and critical thinking. We are in a golden era of Bayesian statistics. Historically, computation was a major barrier, but now we have so many probabilistic programming options (Stan, Pyro, PyMC, Edward) that we can do so much with so little code.

Currently, there are two major barriers to Bayesian statistics. First, practitioners need to deeply understand the probabilistic and statistical concepts; they cannot rely on flowcharts or checklists. Second, much of history has been dominated with frequentist statistics and there is a very real mental barrier to switch from this way of thinking.

Overcoming this mental barrier is what made me switch. Hopefully this article can convince the reader that Bayesian statistics is the more natural and human way of working with data. The rest of this article assumes familiarity with probability.

On the Subjectivity of Priors

History has not been kind to the Bayesian school of thought. Ronald Fisher, the father of modern statistics, actively tried to stomp it out. And the bad rep persisted. Freedman wrote:

What objects in the world correspond to probabilities? This question divides statisticians into two camps:

(i.) the “objectivist” school, also called the “frequentists”;

(ii.) the “subjectivist” school, also called the “Bayesians,” after the Reverend Thomas Bayes

And herein lies the biggest mental barrier. After all, who doesn’t want to be objective in their analysis?

But is “objectivity” truly the gold standard?

I will not make comments on the differences between a confidence interval and a credible interval. However, I will note that Bayesian credible intervals constructed using “objective” non-informative priors end up looking very similar to frequentist confidence intervals. Also, frequentist statistics is hardly objective. And, to be clear, the “objectivist” term refers to the interpretation of probability, but I think it carries over to how frequentist statistics is seen as more impartial.

To illustrate my point about “objectivity”, let’s use the Animals dataset in R’s MASS package. It contains estimates of the average body mass and brain mass of some land animals, including dinosaurs. I will convert all to kg. Suppose we want to model brain mass as a function of body weight.

Consider the statement “We know nothing about how brain mass relates to body mass and want to be as objective as possible. Let the data speak for itself.” It sounds good. But the statement is saying that, prior to seeing any of the data, any of these regression lines are equally likely:

If the statement of “objectivity” were rephrased as:

  • “I think it’s equally likely that brain mass goes up or goes down as body mass increases.”
  • “I think it’s equally likely that brain mass goes up by 0.001kg or 10kg for every 1kg increase in body mass.”
  • “I think it’s equally likely that an animal with body mass of 10kg has a brain mass of 0.001kg or 10kg.”

Then it becomes clear how ridiculous the original statement is. We have prior understanding of how the world works. The Bayesian framework allows us to incorporate our understanding of the world into our model. For instance:

  • “I think on average brain mass should increase as body mass goes up.”
  • “I think brain mass should go up by around 1% for every 1% increase in body mass.”
  • “I think brain mass should account for around 5% of body mass, but I am highly uncertain.”

These are much more reasonable statements than the “objective” statements. Let’s encode our beliefs into statistical notation:

We can translate these prior beliefs into a visual aid:

Before seeing any data, this is the ballpark of what we think the regression line will look like. This looks much more reasonable than the “objective” assumptions! We overlay the actual data:

So our guess is reasonable, but still a miss. No worries, we can run MCMC to update our estimates! This is what we think the regression line might look like, after seeing the data:

The red line is the MLE. Note how it is slightly lower than the (typical) other lines. The Bayesian estimate is a weighted average of the MLE and our prior belief.

It’s not perfect, but that’s why modeling is iterative. Perhaps the relationship isn’t linear? I will not go into refining the model because hopefully I have made my point: a well-chosen prior might make more sense than an “objective” stance.

Note how the focus changed from “which statistical test should I use?” to “what do I know about the world?” and “what assumptions make sense?”

How do we choose a good prior? Sometimes we don’t have the background knowledge. It’s uncomfortable. But in the next section hopefully I can convince you that a weakly regularizing prior is better than no prior.

With the right dataset, we can use Bayesian hierarchical models, which learns the priors from the data. As Gelman remarks:

On the theoretical side, hierarchical models allow a more “objective” approach to inference by estimating the parameters of prior distributions from data rather than requiring them to be specified using subjective information.

And that’s why I love hierarchical models so much. It works well when you have the domain knowledge, and it will support you decently even when you don’t have the domain knowledge.

L1 and L2 Regularization

In practice, people do not value “objectivity” as much. They, knowingly or not, prefer biased estimates!

We know that elastic net (including ridge and lasso regression) empirically makes better predictions than MLE. The MLE, under standard model assumptions, results in unbiased coefficient estimates. However, we can reduce total error by allowing for some bias:

Source

This is done by adding a penalty term to the objective function. Bayesian statistics is all about adding bias in the hope that it will reduce total error. For simplicity, assume Gaussian errors. Under MLE, we want to maximize the Gaussian likelihood:

Which is equivalent to maximizing the log-likelihood:

The convention is to minimize the negative log-likelihood, i.e. minimize the sum of squares.

In ridge regression, we want to maximize this quantity:

The first term corresponds to the MLE objective function, while the second L2 term adds a penalty proportional to the magnitude of the coefficients. We can convert this to a likelihood function:

And as it turns out, ridge regression is Bayesian regression with a Normal(0, s²/λ) prior on all the coefficients except the intercept.

What about lasso? The objective function is:

We incorporate an L1 term to the MLE objective. And if we turn this into a likelihood function:

This is Bayesian regression with a Laplace(0, 2s²/λ) prior on all the coefficients except the intercept, and we use the posterior mode (MAP) for predictions.

In other words, elastic net is not “objective.” We are saying “I think most of the coefficients will be rather small in magnitude.” And it pays to not be “objective” as it leads to better predictions.

Selecting λ using cross-validation can be viewed as a form of empirical Bayes, i.e. we estimate the prior from the data. Empirical Bayes is a deeply philosophical topic (is a prior calculated from the data even a prior?), so professors ask that we don’t ask too many questions about the rationale. We just know it works really well in practice.

As λ → 0, the coefficients go towards the MLE. It is very unlikely that the chosen λ will be 0, suggesting that a weakly regularizing prior should have benefits over a noninformative prior, even when we don’t have domain knowledge.

I suspect regularized regression is the most commonly used machine learning algorithm in practice. Therefore, most people already use Bayesian regression for its enhanced predictive performance relative to MLE. It is only a short mental step to accept that, perhaps, regression should be Bayesian by default. We remove the restriction that all the coefficients share the same prior; hierarchical regression allows some coefficients to share the same prior, which is partially estimated from the data.

In Closing

Repeat after me: being “objective” is not always the best.

It depends on the situation, of course. If we want to infer a causal effect, we might care about obtaining unbiased estimates. But, in many cases, we already prefer biased over unbiased estimates to get “better” point estimates (in terms of total error).

Hopefully this article helped you to break free from the procedural rituals of frequentist statistics.

You are free now.

You can focus on building models, challenging assumptions, and thinking critically about the world.

--

--

Wicaksono Wijono
Analytics Vidhya

Bayesian data scientist. Alternates between light reading and more in-depth articles about applied statistics and machine learning.