As part of my Algorithms in Journalism course at Columbia, I re-implemented your analysis to explore the question: what happens if we look at the last 20 payments, not the last 10 or 15 as in your analysis. The notebook is here: https://github.com/jstray/lede-algorithms/blob/master/week-6/stormy-daniels-payments-simulation.ipyn

The answer is that the likelihood of some subset of 20 random payments summing to within $1 of $130,000 is about 80%! So if we look just a little farther back through the payment stream then you did, it’s nearly certain we’ll find a subset that get close. The sensitivity on this free parameter casts doubt on the robustness of this type of statistical test.

What is the “correct” number of payments to use as a baseline for simulation? In other words, what is our universe of events that were using to calculate the probability of this particular thing happening? In a previous comment you noted that “the set was fixed at ten because we’re trying to estimate the odds of the original discovery, which was found in a series of eight or so payments” which has a pleasant “let’s let the data dictate the model” kind of rationale.

But the whole concept of frequentist inference is that we reason about the statistics of the process, independent of observed data, so it’s not clear to me if this argument makes sense. Or that any argument about the “right” or “objective” number of payments to check with this method can really be solid.

I’d prefer to see a fully Bayesian attempt, modeling the generation process for the entire observed payment stream with and without the Stormy payoff. Then the result would be expressed as a Bayes factor which I would find a lot easier to interpret — and this would also use all available data and require explicitly choosing a bunch of domain assumptions.

Most fundamentally, I worry that that there is no domain knowledge in this significance test. How does this data relate to reality? What are the FEC rules and typical campaign practice for what is reported and when? When politicians have pulled shady stuff in the past, how did it look in the data? We desperately need domain knowledge here. For an example of what application of domain knowledge to significance testing looks like, see Fivethirtyeight’s critique of statistical tests for tennis fixing: https://fivethirtyeight.com/features/why-betting-data-alone-cant-identify-match-fixers-in-tennis/