Introduction to Bayesian Modelling with Stan and R
For the past month, I have been learning on how to apply Bayesian Statistics using Stan. For those who aren’t familiar with the Bayesian method of thinking, it’s a way to perform statistics based on prior belief and updating it with the actual data. Genius, right!
Let’s take a simple example where you have one column of data that contains the weight of dogs. Assume that the average weight follows a normal distribution and you want to estimate what are the parameters of the distribution. We assume because in reality the normal distribution spreads to negative infinity to positive infinity. Recall a normal distribution has two parameters, mean (mu) and standard deviation (sigma).
The following code will generate some data. We will also store it in a named list called “my_data”. This is the data format that Stan takes.
# Generate 10 data points following a normal distributionset.seed(2019)
weights <- rnorm(10, 10, 5)my_data <- list(N = length(weights),
x = weights)>>> round(weights,0)
[1] 14, 7, 2, 15, 4, 14, 6, 13, 3, 8
Awesome! You have some data. Now like I mentioned previously, you can provide your prior beliefs to your model. These are known as priors or statistical assumptions of the data. Stating the appropriate prior can be a challenge in itself, but one naive way (and generally not the best!) is to simply specify a uniform distribution and let the data shape the posterior distribution. In Stan, this is what happens when you don’t specify any priors to your variables. If you’re wondering what Stan is, it’s a language that give you the ability to realize Bayesian analysis. Stan is the current state-of-the-art platform for performing Bayesian inference [1] and you can run it on many languages, but here I will focus on R.
For this example, we have two parameters that we want to estimate (mu and sigma). To keep this simple, we will stick with a uniform prior. But, a standard deviation cannot go bellow 0, so we must include this in our model definition.
Heres what the code for the Stan model looks like:
model <- "
data {
int N; // # of observations
vector[N] x; // vector of weights
}
parameters {
real mu; // What we want to estimate
real<lower=0> s; // What we want to estimate with a condition
}
model {
target += normal_lpdf(x | mu, s); // Likelihood
}"
Great, now we can sample some data points from the posterior distribution of the parameters.
fit <- sampling(stan_model(model_code=model), data = my_data)plot(fit)
You now have a distribution of both parameters of your model. It is saying that:
- mu has a mean of 8.44 and a standard deviation of 1.86
- sigma has a mean of 5.80 with a standard deviation of 1.63
This isn’t what we had defined, but the reason is because the sample was too small. It is credible though, because the intervals cover 10 for mu and 5 for sigma. I re-ran this example using 50 numbers instead of just 10. Here’s the output:
- mu has a mean of 9.63 and a standard deviation of 0.77
- sigma has a mean of 5.32 with a standard deviation of 0.55
The intervals are narrower and closer to the expected values.
So why do we need Stan? Can’t we solve this analytically? For a basic model, yes but not for complicated ones. Also, there are many libraries like RStanArm that offer some Bayesian inference capabilities, but they aren’t fully customizable. A side note, if you are familiar with the open source forecasting package Prophet that was released by Facebook, well they use Stan in the backend [2]. You might be wondering, well if I don’t plan on building the next state-of-the-art forecasting library, why should I care? I think you should care because this allows you to model the data.
Someone once told me that many people try to fit the data to a specific machine learning model. The difference with Bayesian Inference, is we fit the model to the data.
Here I provide a few examples that grow in complexity for those who wish to learn how to use it.