A first look at computation

We will be digging far deeper into this as the semester goes on, but having already seen a first Bayesian inference step in the 3rd seminar, it’s time for us to try to Compute Things™.

The tool of choice is going to be (because it’s what BDA3 likes) Stan, accessed through the RStan package and — preferably — in an RMarkdown document written in RStudio. If you have taken my lower level statistics courses, this will all seem awfully familiar.

Stan is its own language, specifically designed for probabilistic and Bayesian computing. As luck has it, knitr and RStudio both understand Stan, and we can simply build a separate code block for the Stan code specification.

At a minimum, a Stan model has:

  • a data section, specifying what information you will be providing the model
  • a parameters section, specifying what information you are trying to quantify
  • a model section, specifying how everything is connected, and what probability distributions are going to be used for everything.

There is a lot more that can be done — specifying custom written functions, computing transformations on parameters, specifying extra information you want to be able to extract, etc. But for now we will focus on these three and a minimal example.

Our example

From the seminar discussion, we met the problem of estimating whether or not an infection rate of 10% was plausible having seen 2 infected in a sample of 20 randomly chosen individuals.

Similarly, in the BDA3 text, there was a classical example estimating whether or not 50% was a plausible rate of female births in Paris in the 1700s based on observations of total numbers of male and female births over several years.

Both of these do roughly the same thing: estimate a probability parameter for a Binomial model. So our model looks approximately the same in both cases:

Given

Observed total number of trials N, and observed successes S.

Estimand

Probability 𝜃

Model

S ~ Binomial(𝜃, N)

Prior

𝜃 ~ Beta(1,1) ( = Uniform)

As we will see, specifying it like this will give us the corresponding Stan code almost immediately for free.

Translating to Stan

First off, Stan code blocks start with ```{stan, output.var=”model”} . This makes RStudio handle writing the code out to a file and running the Stan compiler over it automatically. The resulting model object is written to the variable name specified as output.var in the code block initializer.

The three sections are specified as follows:

data {
...
}
parameters {
...
}
model {
...
}

Here, within the { .. } , statements are ; -terminated, and comments are included using //.

For both the data and the parameters block, the contents is a sequence of variable declarations. For our purposes, we will use the data types int and real. We will also be using a really cute feature — Stan allows you to state some of your assumptions on the variables outright.

Vectors are possible (actually quite common) — but not needed here, so we won’t include those.

Now to translate from our spec to the Stan model. The Given will fit in data, the Estimand will fit in parameters and the Prior and the Model both fit in model. We know that both N and S are non-negative integers — and S cannot be larger than N, and we know 𝜃 to be a real number in the interval [0,1]. Transcribing everything looks like this:

data {
int<lower=0> N;
int<lower=0,upper=N> S;
}
parameters {
real<lower=0,upper=1> theta;
}
model {
theta ~ beta(1,1);
S ~ binomial(N, theta);
}

This forms a complete Stan program. Running the code block (remember to close it with ``` ) generates a compiled model object available in the next R code block afterwards.

Training the model is done using the sampling function in rstan . It takes the model as its first argument, and also the argument data=list(N=20,S=2) — handing the model the data is done through a list with named entries. Under the hood this launches a Monte-Carlo Markov Chain sampling process (which we will talk more about later int he semester) and will print out piles of information about the process.

Once the model is trained, we can do various things with it. For starters, a really useful information dump comes from the print function. It can look like this:

fit = sampling(model, data=list(N=20,S=2))
# skipped LOTS of printouts here
print(fit, digits=3)
Inference for Stan model: stan-1173d2bab4c1d.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
theta 0.134 0.002 0.071 0.030 0.080 0.124 0.177 0.297 1372 1.000
lp__ -9.296 0.020 0.720 -11.432 -9.494 -9.012 -8.822 -8.763 1363 1.002

Samples were drawn using NUTS(diag_e) at Tue Sep 17 20:54:42 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

There is a lot of information packed here. From what we have learned so far in this study group, the key numbers are mean, 50% (median) and the interval with endpoints 2.5% and 97.5% (the 95% central posterior interval).

For 2 infected among 20 observations, we have a mean infection rate estimate of 13.4%, a median of 12.4% and a central posterior interval of (3.0% — 29.7%). There is not a good reason here to reject 10% as a possible value. (unless, that is, we doubt the uniform prior — in which case, redefining the model and rerunning the sampling will give us new information)

There are also plenty of functions for plotting interesting depictions of the data. Beyond the point-interval estimates we extracted here, the stan_ plotting functions for ggplot2 are quite useful. RStan includes a quick way of setting up the plots through quietgg, so that the command quietgg(stan_dens(fit)) for instance produces a density plot of the samples from the posterior distribution of 𝜃, as seen here. We have a pretty wide spread of values that do peak quite close to 10%, but spreads out quite far — as could also be seen in the (3.0% — 29.7%) posterior interval.

Over to you

To check whether you have understood this, I will let you do the next example on your own. BDA3 describes Laplace working on data with births in Paris 1745 to 1770. The observations were:

  • # girls born: 241,945
  • # boys born: 251,527
  • # total born: 493,472

The question Laplace approached was whether or not 50% was a reasonable estimate for the proportion of girls born.

Your task: Fit a binomial model with some prior that you like (uniform, or some other beta prior are the easiest — they only require you adapt the code here slightly). Then check whether 50% falls within the posterior interval. Draw a histogram or a density for the random draws from the posterior distribution. Summarize everything by offering an argument about whether or not a 50% female birth rate can be rejected or not.

--

--

Mikael Vejdemo-Johansson
CUNY CSI MTH594 Bayesian Data Analysis

Applied algebraic topologist with wide-spread interests: functional programming, cooking, music, LARPs, fashion, and much more.