Posterior Predictive Checking (6.7.2)
Our previous look at exercise 2.11.13 continues in exercise 6.7.2 with posterior predictive checking of the models we implemented.
The question posed is whether — for each of the model we constructed — we can check
- the assumption of independence between the years, and
- the assumption of a lack of a time trend
through picking appropriate test statistics and performing a posterior predictive check.
The source exercise poses two models each for two quantities (# of fatal accidents; # of passenger deaths) over a decade — either as joint constant rate independent Poisson distributions, or as independent Poisson distributions with rate proportional to the total number of passenger miles travelled. We will — like in the previous post — leave the passenger death count as an exercise to the reader and focus on treating the accident counts.
For all of our solutions, we will be doing replication predictions: predictions under the exact circumstances that come as input to our models. For the constant rate model, this just means drawing at random from that distribution, while for the exposure model — where rates are proportional to the exposure of passenger miles travelled — it means reusing the data from each year to calculate the rates.
Our modified models look like this for the constant rate model:
```{stan output.var="model"}
data {
int<lower=0> N;
int<lower=0> FatAcc[N];
}
parameters {
real<lower=0> theta;
}
model {
theta ~ gamma(25.55, 1.07);
FatAcc ~ poisson(theta);
}
generated quantities {
int<lower=0> FatAccPred[N];
for(n in 1:N)
FatAccPred[n] = poisson_rng(theta);
}
```
And for the exposure model, our Stan code ends up looking like this:
```{stan output.var="expmodel"}
data {
int<lower=0> N;
int<lower=0> FatAcc[N];
real<lower=0> MilesE12[N];
real<lower=0> MilesE12new;
}
parameters {
real<lower=0> theta;
}
model {
theta ~ gamma(25.55, 1.07);
for(n in 1:N)
FatAcc[n] ~ poisson(MilesE12[n]*theta);
}
generated quantities {
int<lower=0> FatAccPred;
int<lower=0> FatAccPreds[N];
FatAccPred = poisson_rng(MilesE12new*theta);
for(n in 1:N)
FatAccPreds[n] = poisson_rng(MilesE12[n]*theta);
}
```
In both of these models the main change we have done is to introduce an array in the generated quantities
part that generates as many values as we have input years.
After running sampling
to fit the model and generate random draws from the posterior and of the generated quantities, we use extract
to get hold of the random draws:
```{r}
FatAccPred = extract(fit, permuted=TRUE)$FatAccPred
expFatAccPreds = extract(expfit, permuted=TRUE)$FatAccPreds
```
Independence
While dependency between random variables can happen without Pearson correlation, a significantly non-zero Pearson correlation is an indication against independence of variables. An alternative would be to use either Spearman or Kendall correlation, to avoid getting stuck on linearity vs. non-linearity in the dependencies. Thus one way we should be able to check whether there are dependencies between the years would be to take a look at the correlations between adjacent years — the lag 1 autocorrelations. We will use Spearman.
Using sapply
we can comfortably calculate our statistic on the simulated data. Thus, this code will calculate the data autocorrelation and the simulated autocorrelation for the constant rate model.
```{r}
y = cor(accidents$FatAcc[2:nrow(accidents)],
accidents$FatAcc[1:(nrow(accidents)-1)],
method="spearman")
yrep = sapply(1:(dim(FatAccPred)[1]), function(i) {
cor(FatAccPred[i,2:(dim(FatAccPred)[2])],
FatAccPred[i,1:((dim(FatAccPred)[2])-1)],
method="spearman")
})
```
For the exposure rate, we simply replace every instance of FatAccPred
with expFatAccPreds
.
We calculate a p-value (tail area) using a standard formula: rank/N. A nice plot can be produced using ggformula
and ggplot
:
```{r}
p = rank(c(y,yrep))[1]/dim(FatAccPred)[1]
p = min(p, 1-p)
data.frame(yrep=yrep) %>%
gf_rug(~yrep, alpha=0.05) %>%
gf_density(~yrep, fill="blue", alpha=0.25) %>%
gf_vline(xintercept = y) %>%
gf_labs(title=glue("one-sided tail area p = {p}")) +
theme_light()
```
The code to do this for the second model is pretty much identical — requiring y
and ypred
to have been computed using expFatAccPreds
instead.
It would seem that the assumption of independence between the years is not obviously unreasonable.
Time trend
The second question was whether there was a time trend in the data that we ignored when modeling. For this, an immediate candidate is to look at correlations between the fatal accident count and the year. We will use Spearman correlation again to allow for non-linear trends. A lot of the code looks the same, only the calculations of y
and yrep
change somewhat. For the constant rate Poisson model, we get
y = cor(accidents$FatAcc,
accidents$Year,
method="spearman")
yrep = sapply(1:(dim(FatAccPred)[1]), function(i) {
cor(FatAccPred[i,],
accidents$Year,
method="spearman")
})
Calculating p-values and plotting the distributions we get for our first model:
And for the second, exposure Poisson, model:
Over to you, dear reader
Now it’s your turn. Having dealt with the Passenger Death variable before, perform posterior predictive checks to see whether the data is compatible with these two modeling assumptions.