Understanding Bayesian Linear Regression model output in R using mtcars

R Train Data
10 min readFeb 2, 2024

--

The Bayesian Workflow

The Bayesian workflow is a way of thinking about probability and uncertainty that involves updating beliefs as new evidence becomes available.

The process works like this: We start with initial beliefs about a situation and we express those beliefs as probability distributions. We call this the prior. We then gather new data and assess the likelihood of observing our new data given our previous beliefs. Finally, our previous beliefs, when combined with the likelihood of those beliefs, gives us our updated beliefs, which we call the posterior.

Getting Started

In this article, we usemtcars and the rstanarm package in order to demonstrate Bayesian linear regression.

The mtcars dataset is a built-in dataset in the R environment that provides information on various aspects of 32 different car models. It is derived from the Motor Trend Car Road Tests published in 1973.

We start by creating a Bayesian linear regression model against our dataset.

library(rstanarm)
data(mtcars)
stan_glm(mpg ~ wt, data = mtcars)
stan_glm
family: gaussian [identity]
formula: mpg ~ wt
observations: 32
predictors: 2
------
Median MAD_SD
(Intercept) 37.3 2.0
wt -5.3 0.6

Auxiliary parameter(s):
Median MAD_SD
sigma 3.1 0.4

When we print the model output, we see the coefficients for the intercept and slope. The intercept has a median value of 37.2 mpg. The slope has a median value of -5.3. We interpret the slope by saying that, for each unit increase in wt, the model predicts a decrease of approximately 5.3 mpg.

Unlike ordinary least squares linear regression, the coefficients with Bayesian linear regression have measures of variability in addition to central tendency. This is because the coefficients are derived from distributions, specifically the posterior distribution for each parameter. In our case, the median absolute deviation of the intercept and slope is 1.9 and 0.6, respectively.

Let’s graph our regression model so we can see what it looks like. We use the geom_abline() layer of ggplot() to put the slope and intercept of our regression line.

ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_abline(intercept = 37.3, slope = -5.3, color = '#1e466e') +
ggtitle("mtcars") +
labs(subtitle = "mpg ~ wt")

Moving On to New Data

The power of Bayesian analysis is the ability to transfer what we learned from previous models or different domains into our new modeling process.

Perhaps the best way to illustrate the Bayesian workflow is by creating a new dataset. Here we create a record of possible wt and mpg values for cars in the 1990s. Unlike with mtcars, these wt and mpg values are somewhat made up. Which in this case is an argument for why we should use Bayesian statistics: We may not want to trust the new data entirely by itself.

new_cars_1990s <- data.frame(
car = c("Toyota Camry", "Honda Accord", "Ford Taurus", "Chevrolet Lumina", "Nissan Maxima",
"Mitsubishi Galant", "Subaru Legacy", "Volkswagen Passat", "Buick Regal", "Mercedes-Benz E-Class",
"Acura Legend", "BMW 3 Series", "Lexus LS 400", "Ford Mustang", "Chevrolet Corvette",
"Mazda RX-7", "Toyota Supra", "Nissan 240SX", "Honda Prelude", "Volkswagen Golf",
"Volvo 850", "Audi 100", "Mitsubishi Eclipse", "Pontiac Grand Am", "Ford Explorer",
"Jeep Cherokee", "Toyota 4Runner", "Chevrolet Blazer", "GMC Jimmy", "Nissan Pathfinder"),
wt = c(3.2, 2.9, 3.5, 3.8, 3.2, 3.1, 3.0, 3.2, 3.6, 3.9,
3.3, 3.0, 3.7, 3.5, 3.4, 2.8, 3.1, 3.2, 2.9, 2.8,
2.5, 3.4, 3.0, 2.7, 3.1, 4.0, 4.2, 3.8, 4.5, 4.3), # Hypothetical weights in tons
mpg = c(28, 30, 25, 23, 27, 26, 29, 28, 24, 22,
26, 28, 24, 18, 15, 19, 22, 21, 25, 26,
32, 23, 26, 28, 24, 17, 16, 18, 15, 16) # Hypothetical miles per gallon
)

The Prior

We start with the prior. When we are specifying priors for a model like mpg ~ wt, we are choosing priors for the parameters of our model, which in our case means one parameter for the intercept and one for the slope.

In our first run-through, because we did not specify priors explicitly, our model created its own priors for the intercept and the slope. We can use the prior_summary() function to see the parameters that were chosen.

data(mtcars)
fit <- stan_glm(mpg ~ wt, data = mtcars)

prior_summary(fit)
Priors for model 'fit' 
------
Intercept (after predictors centered)
Specified prior:
~ normal(location = 20, scale = 2.5)
Adjusted prior:
~ normal(location = 20, scale = 15)

Coefficients
Specified prior:
~ normal(location = 0, scale = 2.5)
Adjusted prior:
~ normal(location = 0, scale = 15)

Auxiliary (sigma)
Specified prior:
~ exponential(rate = 1)
Adjusted prior:
~ exponential(rate = 0.17)
------

We see that the prior for the intercept was created using a normal distribution with a median value of 20 and a median absolute deviation of 15, while the prior for the wt coefficient was also created from a normal distribution, but this time with a median of 0 and a median absolute deviation of 15.

We know that a normal distribution is defined by its mean and standard deviation. The stan_glm() function is telling us that it’s creating priors from normal distributions, but it’s reporting the statistics for the center and spread as the median and the median absolute deviation, which is a bit curious. We can reconstruct the normal distributions for the priors by relying on the qnorm() function and working with z-scores.


# Given values
median_value <- 20
mad_value <- 15

# Z-score for the 75th percentile in a standard normal distribution
z_score_75 <- qnorm(0.75)

# Estimate standard deviation (sigma) using the formula
sigma_estimate <- mad_value / z_score_75

# Now, we can use the known median and estimated sigma to find the mean
mean_estimate <- median_value

cat("Estimated Mean:", mean_estimate, "\n")
cat("Estimated Standard Deviation:", sigma_estimate, "\n")
Estimated Mean: 20 
Estimated Standard Deviation: 22.23903

We see that the median is the same as the mean, which is expected from a normal distribution, but we come up with a new number for the standard deviation. We can now graph these distributions using the code below to see what had been chosen as priors for the intercept and slope.


estimated_mean <- 20
estimated_sd <- 22.23903

x_values <- seq(estimated_mean - 4 * estimated_sd, estimated_mean + 4 * estimated_sd, length.out = 1000)
y_values <- dnorm(x_values, mean = estimated_mean, sd = estimated_sd)

a <- ggplot(data.frame(x = x_values, y = y_values), aes(x = x, y = y)) +
geom_line(color = '#1e466e', size = 1) +
labs(title = "Prior for Intercept")

estimated_mean <- 0
estimated_sd <- 15

x_values <- seq(estimated_mean - 4 * estimated_sd, estimated_mean + 4 * estimated_sd, length.out = 1000)
y_values <- dnorm(x_values, mean = estimated_mean, sd = estimated_sd)

b <- ggplot(data.frame(x = x_values, y = y_values), aes(x = x, y = y)) +
geom_line(color = '#1e466e', size = 1) +
labs(title = "Prior for wt")

library(patchwork)
a + b

We see that 22.2 is a big number for the standard deviation of these distributions relative to the numbers in our dataset. It’s not surprising that stan_glm() chose two distributions with wide spreads; this reflects that we have an the absence of information to include in our model.

Generally, when choosing priors, we should strike a balance between informed priors while still keeping an openness to empirical evidence. Practically, this means that considering the width of these priors is very important. If the distributions were very narrow, it would show that we had finely tuned previous beliefs, and it would therefore take a lot of new information to push our view. But having wide distributions in our priors leaves open the possibility of changing our minds. At its extreme, we could even put in a random guess for the median and include practically infinite width, like a uniform distribution.

Let’s now create a model for our new_cars_1990s dataset. We put in the information from our previous model as priors.

stan_glm(mpg ~ wt, data = new_cars_1990s,
prior_intercept = normal(location = 37.3, scale = 2),
prior = normal(location = -5.3, scale = 0.6))
stan_glm
family: gaussian [identity]
formula: mpg ~ wt
observations: 30
predictors: 2
------
Median MAD_SD
(Intercept) 43.6 2.0
wt -5.6 0.6

Auxiliary parameter(s):
Median MAD_SD
sigma 3.7 0.6

Here we would say that for two cars that are the same, the car that has an engine that weights 1,000lbs more will have 5.6 fewer mpg fuel efficiency, on average. We have effectively updated our understanding from our previous model: Instead of thinking that an extra 1,000lbs in engine weight is associated with -5.3 fewer mpg in fuel efficiency, we now believe it is associated with -5.6 fewer mpg in fuel efficiency.

Compare this to the output if we had run the Bayesian analysis without considering old information and relying instead on uninformed priors. Here we would have seen 7.1 fewer mpg in fuel efficiency for each additional 1,000 pounds in engine weight, instead of 5.6.

stan_glm(mpg ~ wt, data = new_cars_1990s)
stan_glm
family: gaussian [identity]
formula: mpg ~ wt
observations: 30
predictors: 2
------
Median MAD_SD
(Intercept) 47.2 4.2
wt -7.1 1.3

Auxiliary parameter(s):
Median MAD_SD
sigma 3.3 0.4

The Likelihood

The prior works in conjunction with the likelihood in order to get the posterior distribution. So far, we have been demonstrating how changing the priors changes the posteriors, but we haven’t shown how the likelihood function is involved in making that change.

The likelihood function describes the probability of observing the data given the parameter values set by our priors. So we take our prior distributions and we determine how likely the actual data is given these distributions.

Here’s the equation for the likelihood function for the Bayesian linear regression. In this case, stan_glm() assumes a Gaussian likelihood because we are modeling a continuous response variable. It could take a different form depending on the assumed distribution of the response variable given the predictors.

Once we have obtained the likelihood of each data point in the distribution, we follow Bayes’ formula and multiply the prior distributions by the likelihood in order to obtain the posterior distributions.

As a final note, in practice, we typically work with the log-likelihood. Logarithms essentially turn multiplication into addition, so working with log-likelihood increases numerical stability and precision when working with computers with 64-bit memory.

The Posterior

Just like with the priors, the parameters for the slope and intercept in the posterior are distributions, not point estimates. We can report the middle of these distributions as our estimates.

Let’s take a look at the Bayesian regression line for our new_cars_1990 dataset. At first glance, we might think it doesn’t look like a terrible great fit. The line is very heavy with residuals on the lower side. But this is not entirely unexpected because our regression line is informed by both mtcars and new_cars_1990.

intercept <- 43.6
slope <- -5.6

ggplot(new_cars_1990s, aes(x = wt, y= mpg)) +
geom_point() +
geom_abline(intercept = intercept, slope = slope, color = '#1e466e') +
labs(title = "mtcars") +
labs(subtitle = "Bayesian Regression Lines")

To go further, we can also save the output of our stan_glm() function as an object, coerce this object to a dataframe, and extract and graph the posterior distributions in order to show our uncertainty around our point estimates.

fit <- stan_glm(mpg ~ wt, data = new_cars_1990s,
prior_intercept = normal(location = 37.3, scale = 2),
prior = normal(location = -5.3, scale = 0.6))
sims <- as.data.frame(fit)

a <- ggplot(sims, aes(x = wt)) +
geom_histogram(fill = '#1e466e') +
ggtitle("mtcars") +
labs(subtitle = "posterior distribution of wt coefficient")

b <- ggplot(sims, aes(x = `(Intercept)`)) +
geom_histogram(fill = '#1e466e') +
ggtitle("mtcars") +
labs(subtitle = "posterior distribution of intercept")

ggplot(sims, aes(x = sigma)) +
geom_histogram(fill = '#1e466e') +
ggtitle("mtcars") +
labs(subtitle = "posterior distribution of sigma")

library(patchwork)
a + b

Finally, we can start a process of random sampling from these distributions. After sampling from these distributions, we can then plot these samples together on a scatterplot as x and y coordinates. These pairs of points on our scatterplot represent possible intercept/slope parameters, which are the same as our regression coefficients. So a point such as intercept = 40 and wt = -5 becomes the line y = 40 + (-5 * x).

Now we graph a collection of regression lines instead of one. By doing this, we are effectively illustrating our level of inferential uncertainty in our Bayesian Linear Regression.

# Number of random combinations to generate
num_combinations <- 50

# Generate 100 random combinations of intercept and slope using dplyr
random_combinations <- sims %>%
select(`(Intercept)`, wt) %>%
slice_sample(n = num_combinations, replace = TRUE)

d <- ggplot(random_combinations, aes(x = `(Intercept)`, y = wt)) +
geom_point(color = '#1e466e') +
ggtitle("mtcars") +
labs(subtitle = "Random sampling of regression coefficients")

e <- ggplot(new_cars_1990s, aes(x = wt, y= mpg)) +
geom_point() +
geom_abline(intercept = random_combinations$`(Intercept)`, slope = random_combinations$wt, color = '#1e466e') +
labs(title = "mtcars") +
labs(subtitle = "Bayesian Regression Lines")

d + e

When we overlay these regression lines on top of our new_cars_1990s dataset, we see heteroscedasticity in the residuals because by itself the new_cars_1990s wanted a slope of -7.1. But because we are committed to a Bayesian workflow, our best fit line is made with consideration to both mtcars and new_cars_1990s.

--

--