Model Selection (continuing 6.7.2 and 2.11.13)

We have previously looked at a dataset with fatal airplane accidents. First, we created two different models for the count of fatal accidents:

  1. FatAcc ~ Poisson(𝜃)
  2. FatAcc ~ Poisson(PassengerMiles * 𝜃)

Next, we performed posterior predictive checks on both models. The models assume — among other things — independence from year to year, and the absence of a time trend (beyond the exposure model’s increases in total passenger miles flown). Already for the posterior predictive checks we get reasons to abandon model 2: the time trend assumption was shown to be false.

Nevertheless, as an example for how to do model selection, we shall press on!

Bayesian Data Analysis (3rd ed) ultimately recommends the leave-one-out cross-validated information criterion (LOO-IC), with the Watanabe-Akaike information criterion (WAIC) as a decent more lightweight alternative.

Both are implemented in the R package loo , where the functions all expect you to provide log likelihoods for your data from each of your MCMC samples. The easy way to do this is to embed your log likelihood computation into the Stan code directly — so that as you sample, you immediately compute log likelihood as well. Alternatively, you could use something like scipy.stats to get access to likelihood functions and compute it yourself while iterating over the sampled parameter values.

The Stan code might end up looking like this for Model 1:

data {
int<lower=0> N;
int<lower=0> FatAcc[N];
}
parameters {
real<lower=0> theta;
}
model {
theta ~ gamma(25.55, 1.07);
FatAcc ~ poisson(theta);
}
generated quantities {
int<lower=0> FatAccPred[N];
real log_lik[N];
for(n in 1:N) {
FatAccPred[n] = poisson_rng(theta);
log_lik[n] = poisson_lpmf(FatAcc[n] | theta);
}
}

We have added as a generated quantity not only the FatAccPred we used in the last post, but also the array log_lik. The name here is a naming convention that makes it easy to use functions from loo to extract the data for later processing: loo::extract_log_lik will look for a generated quantity named log_lik .

We fill the array with the log likelihoods. Since the Poisson distribution is discrete, the log likelihood function has the _lpmf suffix (for log probability mass function) instead of the continuous _lpdf suffix. Likelihood functions in Stan take first the data value at which to compute the likelihood, then separates it from the parameters to evaluate with an | (rather than the function argument , one might have expected).

Similarly, for Model 2, the Stan code might look like this:

data {
int<lower=0> N;
int<lower=0> FatAcc[N];
real<lower=0> MilesE12[N];

real<lower=0> MilesE12new;
}
parameters {
real<lower=0> theta;
}
model {
theta ~ gamma(25.55, 1.07);
for(n in 1:N)
FatAcc[n] ~ poisson(MilesE12[n]*theta);
}
generated quantities {
int<lower=0> FatAccPred;
int<lower=0> FatAccPreds[N];
real log_lik[N];
FatAccPred = poisson_rng(MilesE12new*theta);
for(n in 1:N) {
FatAccPreds[n] = poisson_rng(MilesE12[n]*theta);
log_lik[n] = poisson_lpmf(FatAcc[n] | MilesE12[n]*theta);
}
}

Regardless of model, the next step is to use sampling to run the MCMC calculation and get a random sample from the posterior — together with predictions and log likelihoods the way these two models were written.

With the log likelihoods generated, extract_log_lik will pull the data out. In order to use LOO-IC, we will need the relative effective sample sizes, which are calculated by comparing different chains of the MCMC output. Adding the argument merge_chains=FALSE to extract_log_lik will produce the data in a shape that allows the function relative_eff to calculate these relative effective sample sizes.

Important note: relative_eff takes likelihoods and not log likelihoods. We need to exponentiate to use it.

Put it all together, and the command sequence (after creating the Stan model itself) looks like this:

fit = sampling(model, data=...)
log_lik = extract_log_lik(fit, merge_chains=FALSE)
r_eff = relative_eff(exp(log_lik))

With these quantities in place, we can compute WAIC and LOO-IC as follows

waic(log_lik)
loo(log_lik, r_eff=r_eff)

There is also a function loo_compare that summarizes the comparison for us.

Returning to the fatal accidents case, WAIC and LOO-IC produces the same values for the information criterion, resulting in the following:

For Model 1:

Computed from 4000 by 10 log-likelihood matrix

Estimate SE
elpd_loo -30.0 1.8
p_loo 0.8 0.3
looic 60.1 3.6
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

And for Model 2:

Computed from 4000 by 10 log-likelihood matrix

Estimate SE
elpd_loo -40.3 4.4
p_loo 2.3 0.7
looic 80.6 8.7
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

Recall that for information criteria, lower is better. This computation tells us to prefer the first model — the exposure model is not a good enough fit to be a preferred model here.

Over to you

Just like in the previous posts, it is now your task to compare the models as fitted on the fatality counts from the accidents — these blog posts only deal with the number of accidents, not the number of fatalities.

--

--

Mikael Vejdemo-Johansson
CUNY CSI MTH594 Bayesian Data Analysis

Applied algebraic topologist with wide-spread interests: functional programming, cooking, music, LARPs, fashion, and much more.