Posterior Predictive Checking (6.7.2)

Our previous look at exercise 2.11.13 continues in exercise 6.7.2 with posterior predictive checking of the models we implemented.

The question posed is whether — for each of the model we constructed — we can check

  1. the assumption of independence between the years, and
  2. the assumption of a lack of a time trend

through picking appropriate test statistics and performing a posterior predictive check.

The source exercise poses two models each for two quantities (# of fatal accidents; # of passenger deaths) over a decade — either as joint constant rate independent Poisson distributions, or as independent Poisson distributions with rate proportional to the total number of passenger miles travelled. We will — like in the previous post — leave the passenger death count as an exercise to the reader and focus on treating the accident counts.

For all of our solutions, we will be doing replication predictions: predictions under the exact circumstances that come as input to our models. For the constant rate model, this just means drawing at random from that distribution, while for the exposure model — where rates are proportional to the exposure of passenger miles travelled — it means reusing the data from each year to calculate the rates.

Our modified models look like this for the constant rate model:

```{stan output.var="model"}
data {
int<lower=0> N;
int<lower=0> FatAcc[N];
}
parameters {
real<lower=0> theta;
}
model {
theta ~ gamma(25.55, 1.07);
FatAcc ~ poisson(theta);
}
generated quantities {
int<lower=0> FatAccPred[N];
for(n in 1:N)
FatAccPred[n] = poisson_rng(theta);
}
```

And for the exposure model, our Stan code ends up looking like this:

```{stan output.var="expmodel"}
data {
int<lower=0> N;
int<lower=0> FatAcc[N];
real<lower=0> MilesE12[N];

real<lower=0> MilesE12new;
}
parameters {
real<lower=0> theta;
}
model {
theta ~ gamma(25.55, 1.07);
for(n in 1:N)
FatAcc[n] ~ poisson(MilesE12[n]*theta);
}
generated quantities {
int<lower=0> FatAccPred;
int<lower=0> FatAccPreds[N];
FatAccPred = poisson_rng(MilesE12new*theta);
for(n in 1:N)
FatAccPreds[n] = poisson_rng(MilesE12[n]*theta);
}
```

In both of these models the main change we have done is to introduce an array in the generated quantities part that generates as many values as we have input years.

After running sampling to fit the model and generate random draws from the posterior and of the generated quantities, we use extract to get hold of the random draws:

```{r}
FatAccPred = extract(fit, permuted=TRUE)$FatAccPred
expFatAccPreds = extract(expfit, permuted=TRUE)$FatAccPreds
```

Independence

While dependency between random variables can happen without Pearson correlation, a significantly non-zero Pearson correlation is an indication against independence of variables. An alternative would be to use either Spearman or Kendall correlation, to avoid getting stuck on linearity vs. non-linearity in the dependencies. Thus one way we should be able to check whether there are dependencies between the years would be to take a look at the correlations between adjacent years — the lag 1 autocorrelations. We will use Spearman.

Using sapply we can comfortably calculate our statistic on the simulated data. Thus, this code will calculate the data autocorrelation and the simulated autocorrelation for the constant rate model.

```{r}
y = cor(accidents$FatAcc[2:nrow(accidents)],
accidents$FatAcc[1:(nrow(accidents)-1)],
method="spearman")
yrep = sapply(1:(dim(FatAccPred)[1]), function(i) {
cor(FatAccPred[i,2:(dim(FatAccPred)[2])],
FatAccPred[i,1:((dim(FatAccPred)[2])-1)],
method="spearman")
})
```

For the exposure rate, we simply replace every instance of FatAccPred with expFatAccPreds.

We calculate a p-value (tail area) using a standard formula: rank/N. A nice plot can be produced using ggformula and ggplot:

```{r}
p = rank(c(y,yrep))[1]/dim(FatAccPred)[1]
p = min(p, 1-p)
data.frame(yrep=yrep) %>%
gf_rug(~yrep, alpha=0.05) %>%
gf_density(~yrep, fill="blue", alpha=0.25) %>%
gf_vline(xintercept = y) %>%
gf_labs(title=glue("one-sided tail area p = {p}")) +
theme_light()
```
Posterior predictive distribution, and observed data value, for lag 1 autocorrelation using the constant rate Poisson model. Arguably, since we did not pick whether we were looking for a remarkably high or a remarkably low value, we should be doubling the p-value, yielding p~0.17275 and no major indication of an autocorrelation.

The code to do this for the second model is pretty much identical — requiring y and ypred to have been computed using expFatAccPreds instead.

Posterior predictive distribution, and observed data value, for lag 1 autocorrelation using the exposure Poisson model. With this graph and this tail area, we have no reason to exclude independence based on the correlation: model and data behave pretty similar to one another.

It would seem that the assumption of independence between the years is not obviously unreasonable.

Time trend

The second question was whether there was a time trend in the data that we ignored when modeling. For this, an immediate candidate is to look at correlations between the fatal accident count and the year. We will use Spearman correlation again to allow for non-linear trends. A lot of the code looks the same, only the calculations of y and yrep change somewhat. For the constant rate Poisson model, we get

y = cor(accidents$FatAcc, 
accidents$Year,
method="spearman")
yrep = sapply(1:(dim(FatAccPred)[1]), function(i) {
cor(FatAccPred[i,],
accidents$Year,
method="spearman")
})

Calculating p-values and plotting the distributions we get for our first model:

Density plot for Spearman correlation between the year and the constant rate Poisson model. Neither p-value nor plot are terribly dramatic — especially after doubling the p-value to account for the fact that we did not start out looking for a particular tail. A time trend in the data compared to this model is not obvious.

And for the second, exposure Poisson, model:

Density plot for Spearman correlation between the year and the exposure rate Poisson model. After controlling for the amount of travelling done in any year, the lack of time trend assumption looks far less convincing: the data shows a clear decreasing trend, while the simulations all produce increasing trends — which would be reasonable since the exposure keeps going up as air travel gets more and more popular.

Over to you, dear reader

Now it’s your turn. Having dealt with the Passenger Death variable before, perform posterior predictive checks to see whether the data is compatible with these two modeling assumptions.

--

--

Mikael Vejdemo-Johansson
CUNY CSI MTH594 Bayesian Data Analysis

Applied algebraic topologist with wide-spread interests: functional programming, cooking, music, LARPs, fashion, and much more.