# The Statistics Skirmishes

## And the limitations of Twitter for science discussion

“Today I speak to you of war. A war that has pitted statistician against statistician for nearly 100 years. A mathematical conflict that has recently come to the attention of the ‘normal’ people.”Kristin Lennox

#### Table of Contents

- Preamble
- Combatants
- Conversation #1: a significant interaction
- Conversation #2: alternative measures of evidence
- Conversation #3: priors
- Alternatives to Twitter
- Appendix: timeline of recent events

### Preamble

During April 2019 the ‘Statistics Wars’, fuelled partly by a Bayes versus Frequentist divide, played out in prolonged scuffles on Twitter (I have gathered the Tweets below). Aside from the comments sections of rarefied blogs that few people are aware of, Twitter is, unfortunately, where combatants face-off in this war of attrition. The initial exchange was prompted partly by an opinion piece in Nature which over the weekend following publication garnered the highest Altmetric score ever, “with an attention score of over 12,000”. The debate is obviously contentious and I’d suggest the reader gauge their own ‘prior’ and note if it shifts after reading these leading experts. Whether you are indifferent, ignorant, amused or stubborn, there is something to gain from them.

Bearing in mind how much of the literature the participants in this exchange have consumed and produced during their careers to date, each Tweet is compacted knowledge and to the point. There is personality and humour within them too, which is a wonderful thing; personality is too often subdued in textbooks and public lectures (aside from the gem Stephen Senn of course). This is a rare opportunity, then, for those of us who do not make it onto the ‘important’ table at the conference gala event, to eaves drop on the people-with-names who might be overheard in effusive and suddenly combative discussion. After all, the mockery alone is entertaining:

Dichotomania.com: More than 1,645 signatories (p-value < .05) call to ban p-values: The following 1,646 scientists, statisticians, doctors, librarians, plumbers, and YouTube stars from 117 countries and 2 planets are signatories for tar and feathering frequentists

Despite practicising statistics for 20 years I can never shake the feeling of being an apprentice, an insatiable onlooker, with so much to observe and absorb and enjoy. (Likely because I’ve moved around a stupid amount within the field: government, academia, industry, clinical trials, epidemiology, machine learning … perpetually on the verge of becoming an expert). As long as I can remember, I have always turned immediately to the Letters section of the science magazine to watch with voyeuristic glee the pedantic academics bicker with one another as if their reputation was at stake. To me, this was the scientific method in action; the rest of the magazine offered the assurance of ‘breakthroughs’ and had the feel of marketing. But motivation and ideas, and hence change, are incited by competition and disagreement.

Thus, if it takes Twitter to elicit a lively confrontation among these thinkers then I can bear it. But let’s quickly retrieve it, while we can, and place it where we might share it and reread it (it does warrant a careful re-reading). My initial motivation for collecting the Tweets was to display them for colleagues, especially clinical colleagues who do not have the time or inclination to pour over and reconstruct disjointed Tweets. But I can imagine there are many more who do not access Twitter yet, like me, see value in the dialogue (no matter how intransigent and therefore futile the exchange appears). This exercise (extracting and displaying Tweets) makes Twitter’s limitations apparent. Hence, at the bottom of this piece I will suggest alternative platforms for conversation in the hope that some might be enticed to engage in a more prolonged discussion.

A few things to note about how to read the conversations that follow:

- if you are not familiar with the Statistics Wars you might first read the timeline of recent events in the Appendix below
- an asterisk (⁕) indicates that the conversation forks (bifurcates)
- a sideways pointing finger (☞) indicates a response to the preceding comment
- an upward pointing finger (☝) indicates a response to the comment where the conversation last forked; two fingers (☝☝) indicates two forks above; etc
- if it’s not clear what the Tweet is a response to, click on the author’s name and it will take you to that comment on Twitter

### Combatants

“If those with a high level of statistical training make frequent interpretation errors, could frequentist statistics be fundamentally flawed? Yes!” *The Bayesian*: Frank Harrell (Twitter, Blog, Book)

*“* [T]he severe testing philosophy … brings into focus the crux of the issues around today’s statistics wars without presupposing any one side of the debate.*” **The Philosopher*: Deborah G. Mayo** **(Twitter, Blog, Book)

“I do find compelling the arguments that at least some of the problems would have been mitigated by teaching a dualist approach to statistics.” *The Dualist*: Sander Greenland (Twitter, Blog, Book)

“The ASA statement failed to come to any conclusions about what should be done. That is why it’s unlikely to have much effect on users of statistics.” *The Reformer:* David Colquhoun (Twitter, Blog, Book)

(With a cameo appearance by Stephen Senn.)

### Conversation #1: a significant interaction

**Frank Harrell** Confidence intervals cannot be interpreted in such a helpful way […]. As problematic as confidence intervals are, they are still far better than p-values.

☞ **Sander Greenland**** **Don’t identify P-values with sig testing (a misuse of P that’s part of the problem). Agreed, it’s sad 40 years after Rothman, Altman, Bland etc. started we still have to campaign for estimates, but Ioannidis’s incoherent and ill-motivated defense of stat sig leaves us no choice. *[Ioannidis:* *see **Wikipedia**]*

☞⁕ **Frank Harrell** Ioannidis’ solution is highly problematic. But the fact that significance testing and p-values are instantly abuse-able speaks to problems with their foundations.

☞ **Deborah G. Mayo**** **Not at all, their foundations are clear, it’s classical Bayesian foundations that have been rejected, except for staunch subjectivists (Excursion 6 of SIST). *[SIST=Statistical Inference as Severe Testing, Deborah’s book]*

☞ **Frank Harrell**** **Not at all. The defense of Bayes by staunch subjectivists is complete. The foundations of frequentist inference are unclear, motivated by oversimplifying the problem because of lack of computing tools at the time, and don’t provide the inputs for optimal decision making.

☞ **Sander Greenland**** **The foundations of frequentist decision theory are quite clear: Under broad conditions, Bayes procedures are admissible and complete: https://projecteuclid.org/download/pdf_1/euclid.aoms/1177730345 … I marvel at how interlinked freq and Bayes are, yet how some still fight as if each require exclusive religious loyalty.

☞ **Frank Harrell**** **Yes it is amazing how deeply felt the probability issue is. What is intolerable to me is the small minority who were taught in grad school that “Bayes is just wrong” who continue to believe their professors. I was taught the same but eventually rejected my professors’ beliefs.

☞ **Sander Greenland** Yes, but now we have a vocal minority claiming freq is just wrong. It’s crazy because valid freq and Bayes don’t address the same question and can be combined like concrete & rebar for more strength than either alone. Plus sound inference often requires more, e.g. causal models.

☞ **Frank Harrell** Bayes is imperfect. I embrace Bayes because it has the least number of serious problems.

☞ **Deborah G. Mayo**** **1 is enough when it’s fatal

☞ **Sander Greenland** Admitting Bayes is imperfect, can you bring yourself to admit that circumspect frequentism (not voodoo like “significance” and “confidence”) addresses some (not all) of its imperfections? Causal modeling addresses others. And cognitive psych could address the big problem: Humans.

☞ **Frank Harrell** Yes it addresses some of the imperfections. But forgive me for not getting excited about indirect reasoning, reliance on unobservables, sample space complexity, “proof” by contradiction, multiplicity, asymptotics, and non-predictive mode probabilities.

☞ **Sander Greenland** That’s misleading. Asymptotics are just math tools used by everyone eg Laplace for Bayes. Freq does not need proof by contradiction. Unobservables are a problem for everyone, ignoring them is the peril — everyone needs error probes for as some Bayesians know all models are wrong.

☞ **Frank Harrell** Not surprising that we disagree on all of those points.

☞ **Sander Greenland** You deny that Bayes stats can use asymptotic approximations? That frequentists can bypass them? That refutation is not proof by contradiction? That like every method Bayesian analyses can suffer from bias from uncontrollable selection, confounding, and measurement error? Really??

☞ **Frank Harrell** You’re not going to change on these points and neither will I so not clear this is fruitful. Bayes can be done with zero asymptotics. A small portion of frequentist also can. Not interested in refutation. Measurement error is a problem for everyone, and selection usually is.

☞ **Sander Greenland** Not clear why you care about asymptotic approximations given you admit “measurement error is a problem for everyone, and selection usually is.” Why should I then care about being exact? And “Not interested in refutation”? For many that just says you aren’t interested in science.

☞ **Frank Harrell** Reaching extreme conclusions about others’ motivations seems to be your style. And I don’t care or need asymptotic approximations. Signing off.

☞ **Deborah G. Mayo**** **I don’t see the reference to motivations here.

☞ **Frank Harrell** Re-read. “Not interested in science”.

☞ **Sander Greenland**** **I was referring to Popperians, for whom refutation is the core of scientific reasoning. Leading critics of stat sig include Popperians! I’m no Popperian, but Popper had a lot of good points which have led many since to take refutation as a central part of scientific research…

☞ **Deborah G. Mayo**** **Only those w/ a superficial philosophical understanding of Popper would oppose stat tests–and even those I know are happy to (try to) use severe tests for their own purposes.

☞ **Sander Greenland** Rubbish. Among scientists I worked with Charles Poole was one of the most well-read in Popper among anyone I’ve met, and decried stat tests (as developed so far) as ‘caricatures of falsification’. Stat tests are just data processing algorithms, far from complete scientific tests.

☞ **Deborah G. Mayo**** **But you can use them to falsify statistically, & Popper endorsed Fisherian tests & the requirements for them to work (not isolated, check assumps, solve Duhemian problems, avoid fallacies), regretted not employing stat tests more in his phil sci.

☝ **Sander Greenland**** **Not really: P-values have been around 300 years, a quick checking tool that a century ago got blown up and promoted as a magical significance detector. Making this a foundations issue misses the real problem: human incompetence and demands for firm answers from vague information.

☞ **Frank Harrell**** **Agree that the last part is all-important. The approach to decision making in the face of uncertainty is the next key; predictive mode vs. what-could-have-happened mode.

☞ **Deborah G. Mayo**** **What could have happened, & how this tool would have handled it, will always be essential for rooting out poor/good tests.

☞ **Frank Harrell** In your world view. Not mine.

☝ **Deborah G. Mayo**** **p-values can be abused by taking stat sig as substantive & by failing to take account of selection effect, stopping rules, multiple testing. Bayesians (w/ exceptions) think that permitting these, ignoring the their impact, makes them go away. It doesn’t.

☞⁕ **Frank Harrell** Using the Bayesian approach these completely go away if your choice of prior is not vetoed by someone of importance. Interpretation of one piece of evidence, unlike in the sample space world, does not depend on what other pieces were examined.

☞ **Sander Greenland** In human brains interpretation very much depends on what other pieces were examined. Even in the Bayes world, you haven’t said why I the reader should trust your prior, knowing you may well have picked it to favor the conclusion you want.

☞ **Frank Harrell** Bayes doesn’t require you to trust my prior. It request the decision maker to. More to the point, anyone who unjustifiably chooses a prior to favor themselves will be and should be distrusted.

☞ **Deborah G. Mayo** That requires being able to discern they’d chosen a biased prior, e.g., by uncovering selection effects, multiple testing, data-dependent stopping rules. But you say those are irrelevant for inference once the data are in hand, right?

☞⁕ **Frank Harrell** Yes irrelevant. A biased (e.g., overly optimistic) prior is uncovered by just looking at the prior.

☞ **Deborah G. Mayo** I wish the ASA doc would openly admit that “alternative measures of evidence” (like yours) renders selection effects, multiple testing, stopping rules irrelevant. Why not just come out with it? *[ASA doc: in 2016 the American Statistical Association released a **statement** on p-values; see Appendix below for timeline]*

☞ **Frank Harrell** Already covered mathematically by Jim Berger for many of those cases: *[**Bayesian Multiplicity Control**]* Multiplicity comes from chances one gives data to be extreme, not from chances for assertions to be true.

☞ **Deborah G. Mayo** Error statisticians hold the first (adding “even if generated by Ho”) and not the second. Ironically, it’s Bayesian PPVs that are all about chances H is true (from an urn of H’s w/a given PREV of true H’s). *[PPV: positive predictive value. PREV: prevalence. Ho: null hypothesis.]*

☞ **Frank Harrell** There is an important distinction: Bayesian PPV comes from the prior and data and “chances” H is true are not a real part of it.

☝☝ **Deborah G. Mayo**** **Understanding the capabilities of any method is a matter of seeing what they would do with other data: Else we are cut off from criticizing someone from trying and trying again & all the other QRPs that cause replication problems. *[QRP: questionable research practice]*

☞ **Frank Harrell** I can see that being true in many but not all contexts. In many experiments we don’t have the luxury or ethics to repeat.

☞ **Deborah G. Mayo**** **I was alluding to hypothetical or simulated or writing down sampling distns. We can see certain data dredging methods would allow inferring H for any data. Not a matter of literally applying a tool over & over–else we’d never understand properties of methods & measurements.

☞ **Frank Harrell** This is a general issue. Your approach deals with hypotheticals. Bayes deals with observables.

☝ **Deborah G. Mayo**** **Oh Please! If merely looking at a prior would reveal if its overly optimistic we wouldn’t have dozens of rival priors & ways to get them. Bayesians would all recommend the non biased one, right? What we get in fact is a prior w/very convincing “spin”.

### Conversation #2: alternative measures of evidence

Raj Mehta, MD starts the conversation by asking: “How would you explain the appropriate logic, when one wants to interpret a confidence interval to estimate effect sizes for real world application?”

⁕** ****Frank Harrell**** **This is what makes many people become Bayesian. There is no interpretation of confidence intervals that is simultaneously (1) useful and (2) completely correct. Look at what [Sander Greenland] describes — compatibility intervals. Inference indirect:set of non-rejected true values.

☞ **Sander Greenland** This is just the usual confirmationist-Bayes mistake: The inference is direct against values OUTSIDE the interval: the model and data have poor compatibility for those values. For those inside you need more info to better discriminate. This is modus tollens/refutational logic…

☞⁕ **Frank Harrell** Even for outside the interval that is indirect.

☞ **Sander Greenland**** **If by indirect you mean learning from error, yes. Bayesian dogmatism is remarkable in the way it precludes cornerstones of learning — luckily good analysts aren’t so rigid even if they claim to be rigid adherents of some “philosophy.”

**Sander Greenland** …and as such perfectly valid and and recognized in common sense as the process of elimination (eg see Sherlock Holmes). You need to (re)read Popper: Conjectures & Refutations etc. Even DeFinetti recognized this is how stat science (as opposed to Bayes dogma) works:…

☞ **Sander Greenland** As Senn (in Dicing with Death, p. 89) quoted DeFinetti vol. 1 p. 141: “The acquisition of a further piece of information…acts always and only in the way we have just described: suppressing the alternatives that turn out to be no longer possible.” Suppressing = falsifying… *[Senn: Stephen Senn]*

☞ **Sander Greenland** …and the amount of falsifying information against a model in a given dimension (eg departure from “no effect”, linearity, etc) can measured by the surprisal/Shannon S-value s = -log(p) for that departure. Bayes addresses a different query: How should I bet under the model?…

☞ **Sander Greenland** …smart applied Bayesians like Box etc (as opposed to Bayes religionists) have long recognized that before betting, check whether your betting model looks incompatible with data; eg don’t use a fairness model to bet on dice if the FREQUENCIES from rolling them refute fairness!

☝ **Deborah G. Mayo**** **indirect if looking for degrees of belief, direct if looking for falsifications & corroborations (severely tested claims). How do you falsify w/a Bayesian belief? Only if you ADD a falsification rule, but still have to show its properties.

☞⁕ **Frank Harrell**** **Luckily I don’t need to falsify at all.

☞ **Sander Greenland** Unfortunately, that means you won’t be able to learn at all (as DeFinetti knew — see my earlier repost of Senn’s quote from him).

☞ **Frank Harrell** Call it what you will but I’m focusing on decision making, and for 9/10 of what I learn falsification plays no role.

☞ **Deborah G. Mayo**** **Do you decide on the probability? If you have a full Bayesian account, lay it out, or if you have, please give a link to an article or blog. Or if it matches an existing account (Lindley?), say whose. Twitter responses does not an account make.

☞ **Frank Harrell**** **The decision maker, with her (often secret) utility function in hand, decides upon how high a probability is needed for action.

☞⁕ **Sander Greenland**** **True if the decision maker is that sophisticated and is basing the decision only on your analysis and data (was never the case in my experiences). But you didn’t answer Mayo’s question, eg where did your prior and your data model come from? Why should we trust your prior? etc.

☞ **Sander Greenland**** **For an example where a stock informative prior (with a data-killing null spike as pushed by some Bayesians) contradicts real background info yet ends up driving published conclusions see *[**Vitamin E, Mortality, and the Bayesian Gloss**]* How will decision makers know if they got bad Bayes from bad priors?

☞ **Deborah G. Mayo** Did you reanalyze this w/o the spike to the null?

☞ **Sander Greenland**** **Yes and found the evidence favored harm from high-dose synthetic E-supplements, unsurprising but troweled over by the spiked prior. There was no room allotted for details in my comment; the editor (Goodman, then a spiker) discouraged my presenting my analysis as a separate note.

☝☝ **Deborah G. Mayo** I think that says it all right there. I take it you have an exhaustive set of hyps, thys, models for all science perhaps w/ a low prior catchall? (anything left out has prob 0 forever).

☝ **Deborah G. Mayo**** **So you use an agent’s utilities as a way to get the dichotomy that stat sig at a given small p-value is to provide, only now we don’t know if it’s the utility of the consumer? the researcher? the drug co, etc. or if it’s fairly applied across diff cases w/ diff posteriors/utiles.

☞⁕ **Sander Greenland** If there are multiple stakeholders with conflicting utilities, there is no universally optimal decision rule. Hence even sophisticated decisions will come down to judicial imperatives directing who loses and who wins under prevailing civil laws (which are not mathematical!)…

☞ **Frank Harrell** That sort of imperative is often there, independent of the statistical method used.

☝ **Sander Greenland** Stats only deals with highly idealized decision models that may be fine in Frank’s apps (he hasn’t given examples) but break down fast in controversial areas. I find med-stat controversies often trace back more to value/utility differences than to math or methodologic disputes…

☞ **Sander Greenland** E.g., contrast “industrial” views of science publishing as selective release and dissemination of information as determined by those in editorial power (esp. those who demonize false positives with no valid balancing against false negatives, costs, or graduated assessments)…

☞ **Sander Greenland** …against (perhaps idealistic and antiquated?) preference for publishing as an outcome-nonselective (even though critically assessed) source of valid public information. This split about censoring of information flow is one source of the Ioannidis vs. AGM Nature-Comment dispute.

☝☝☝☝☝ **Deborah G. Mayo**** **What about the SEV interpretation? e.g., for a CI for mean mu: data indicate mu > CI-lower because if mu ≤ CI-lower, w/ high prob we would have gotten a larger diff than we did. Likewise for grounding mu < CI-upper. See SIST or a bit in a recent post *[CI: conifdence interval; ‘recent post’: **If you like Neyman’s confidence intervals then you like N-P tests**]*

☞ **Sander Greenland** My sense is that if there really is a basis for taking only one direction of variation (df) as nonrandom, the best a frequentist can offer is a P-value function or compatibility (“confidence”) distribution. From that every other usual statistic (Ps, SEVs, CIs, ) can be read off. *[SEV: severity function, see **Error Statistics** by Mayo & Spanos]*

☞ **Deborah G. Mayo**** **Maybe you should call it the level of incompatibility.

☞ **Sander Greenland** “It”? That would be 1-P on a 0/1 scale and s = -log_b(p) on an additive-information scale.

☝☝⁕ **Sander Greenland** BTW “this is what makes people become” anything is no argument for becoming anything. We can talk of why people become suicide bombers etc but the choice may only reflect their cognitive blindnesses. Example: your claim about (1) and (2) is false, a form of Bayesian blindness…

☞ **Sander Greenland**…check out this discussion and others of confirmation vs falsification/refutation at Gelman’s blog (link: https://statmodeling.stat.columbia.edu/2014/09/05/confirmationist-falsificationist-paradigms-science/) statmodeling.stat.columbia.edu/2014/09/05/con… . For me the take-away msg is: conf and falsif are essential complementary modes of reasoning, with falsification the less intuitive one…

☞ **Sander Greenland**** **…some twist difficulty into a reason to reject falsification, as if human limits should be the arbiter of what is valid. That’s perverse: Newtonian theory was nonintuitive compared to impetus theory, and relativity even less intuitive. Nature doesn’t care about our limitations.

☝ **Frank Harrell** I guess no one else is as smart as you. Most of us have great difficulties understanding confidence intervals.

☞ **Sander Greenland** No one is smart. We feel smart and it’s a cognitive illusion. But with good instruction and practice we can become skilled in using hammers, saws, skis (I took that up at 50), algebra, calculus, frequentist tools, Bayesian tools, etc. Don’t let your age stop your learning.

### Conversation #3: priors

**David Colquhoun** Talking of priors, I’ve just been re-reading Casella and Berger (1987), They seem to think it’s OK to choose a prior that makes FPR similar to p value. They have a lot more faith in ability of experimenters to guess the result than I have. Most bright ideas turn out to be wrong *[FPR: false positive rate. *Casella and Berger (1987)*, see Conversation #2 above]*

☞ **Sander Greenland** Indeed, but that never stopped them from being adopted and then defended to the death, like significance testing. For that matter, neither has being a dumb idea staunched adoption, eg adopting P<0.05/P>0.05 as a universal criterion for reporting presence/absence of association.

☞ **David Colquhoun** Sure. It’s interesting to me since Casella & R. Berger gets cited as though it showed errors in J, Berger & Sellke. I don’t think it does.

⁕ **Sander Greenland** I never saw it cited that way. In 3 1987 pubs CB made the good points that 1-sided P-values bound particular posterior probabilities, and that prior spikes as in BS & BD 1987 make no sense. I saw nothing in CB about an FPR as such. What did they say that you think is a problem?

☞⁕ **David Colquhoun** Every time I cite Berger Sellke, Mayo says what about CB I was thinking especially of Casella, G. and Berger, R. L. [Testing Precise Hypotheses]: Comment. Statistical Science 2(3), 344–347. 1987.

☞⁕ **Deborah G. Mayo** They make it clear the spike is for the Bayesian to avoid an embarrassingly small drop when Ho is rejected! (Also in SIST). But as Berger made clear for the 100th time a couple of weeks ago, it’s wrong w/ 1-sided tests, or w/o special high belief in Ho. *[**The “P-values overstate the evidence against the null” fallacy**]*

☞ **Stephen John Senn**** **2/2 also for drug developers comparing a new treatment to a standard it is clearly inappropriate since to give a high probability of their relative potency being 1 is bonkers. In many such cases you can regard yourself as carrying out two one sided tests.

☞ **David Colquhoun**** **Drug development is different because it’s a multi-stage process. Judging by the number of drugs that fail to live up to promise, surely no prior P(H1) greater than 0.5 can be justified for initial screen at least?

☞ **Stephen John Senn**** **Start thinking about this David! The standard drug has already proved itself efficacious. It’s on the market. The new one has had an experimental dose fixed based on animal studies. Why should it be evens they are equipotent?

☞ **David Colquhoun**** **I am talking about the case where there is no good prior information. As I said, drug development is multi-stage so there MAY be such info.

☞ **David Colquhoun**** **Because that is, only too often, what happens (apart from “me too” drugs.

☞ **Stephen John Senn**** **So take an example I worked on. Formoterol in asthma: we could have taken 6,12 or 24 micro g into development as the standard dose or indeed at an earlier stage some other dose altogether. How can the probability that each of these is equipotent to salbutamol be 50%?

☞ **David Colquhoun**** **1/2 formoterol and salbutamol are closely related, so you have prior info

☝ **Sander Greenland** Beware of “obvious intuitive appeal”: pi0=0.5 is a huge bolus of fake information that creates irreversible null bias. In most med-res problems what is obvious from the context is that pi0 should be zero because many paths for association are left open by the actual information.

☞ **David Colquhoun** I think you are much more optimistic about “positive” outcomes than I. Most bright ideas turn out to be wrong.

☝☝ **Sander Greenland**** **Insufficient. See Casella Berger, Reconciling Bayesian and Frequentist Evidence in 1-sided Testing w/disc JASA 1987;82:106–135, and the discussion of it by Greenland-Poole/Gelman in Epidemiology 2013;24:62–78. Then how to look at P-evidence without Bayes: *[* *Valid P-Values Behave Exactly as They Should**]*

☞ **David Colquhoun** Yes, but by “reconciling” they mean choosing a prior which makes p value close to posterior. That’s cheating IMO

☞ **Sander Greenland** No it is perfectly valid. A truly objective subjective (subjunctive, per Senn) Bayesian analysis can and should point out what priors give P-values Bayesian meaning. A scientist can then check whether any prior she would accept as reasonable is in such a class. Same for CIs etc.

☝☝☝ **David Colquhoun** For FPR read P(H0 | data). I simply can’t agree that spike priors make no sense. The LR is much simpler (no Bayes) and it’s ~3 when p=0.05 -surely that is sufficient to say that p values exaggerate evidence against H0 *[LR: likelihood ratio]*

☞ **Sander Greenland**** **Sufficient, unnecessary, & misleading. You don’t need any Bayes to gauge the evidence. Eg look at -log2(p): it’s 4.3 at p=0.05, hardly more surprising under H than 4 heads in a row for a fair coin. Credible priors need a data basis; show me real RCT data justifying a point spike.

☞ **Frank Harrell** Spike priors do not make sense from what we know about beliefs, science, and what works most of the time.

*[…]*

☞ **David Colquhoun** but what about “most bright ideas turn out to be wrong”?

☞ **Sander Greenland** OK, here’s one of many big bright ideas in stats that turned out to be wrong beyond wrong: Using point-mass spikes in priors for complex physiologic responses to drugs so poorly understood that we were willing to invest in a human RCT to get inevitably noisy information.

☞ **David Colquhoun** why is that wrong.? Most drug candidates fail

☞ **Sander Greenland** I’ve explained that in previous responses; so has Stephen. We have NO prior information the drug never affects anyone at all in any way; a .5 spike means instead half our prior information says exactly that! that’s false information (fake news). And it’s totally unnecessary!…

☞ **Sander Greenland** …you have completely confused your real prior information (“most candidates fail”) with a spike. Your real information is counting noisy failures to pass screening studies; its valid representation requires accounting for that with distributions for errors and uncertainties…

☞⁕ **David Colquhoun**** **It’s fascinating that you seem to be far more optimistic about drug discovery than I, given that I spent much of my life teaching pharmacology :-)

☞ **David Colquhoun**** **I’d guess that even among approved drugs around half have marginal benefits. Luckily the other half are quite useful.

☞ **Sander Greenland** Then use 0.5 to mix your marginal-benefit prior with your quite-useful prior when judging approved drugs. Better yet write down a prior including them both since there is no sharp boundary, accounting for the censoring of follow-up for unapproved drugs. In other words, get real.

☞ **David Colquhoun** In any case, a spike prior is not essential for my conclusions, Sellke et al and V. Johnson use priors that are designed to maximise the odds in favour of rejection of the null. It turns out that, nevertheless, they reject the null hypothesis much less often than the p-value

☞ **Sander Greenland** That’s a totally fallacious comment. P-values don’t reject anything, they just sit there. The rejection is by a culture that took poorly thought-out remarks by Fisher and enshrined 0.05 as a magic universal decision rule. That a mass psychosis gripped science is hard to face…

☞ **David Colquhoun** of course. That’s why I said “p vales as commonly misinterpreted” !!

☝ **Sander Greenland** I am not a bit optimistic: Your comment only shows you haven’t understood a word I’ve said. You can be as pessimistic given your real experience, which is like mine profoundly noisy; just use a small variance for skepticism. A spike is a god-given certainty hence unacceptable.

☞ **David Colquhoun**** **If you use a prior with a small variance, centred on zero, you’ll get much the same answer as me (Berger & Delampady). *[Berger & Delampady: **Testing precise hypotheses**]*

☞ **Sander Greenland** Depends on what you mean by “small”: If small enough so that the spike result is an approximation the result is tautological. If we actually go through past studies to build a well-informed prior, we may find that isn’t the case at all. You have to do that to see and if you do…

☞ **Sander Greenland** …then why bother with the approximation? which as BD themselves admit in sec. 2.3 breaks down with large amounts of real prior data. Your claims all seem to pivot on vast experience over many decades and trials, so is exactly the case where spikes break down in honest theory. *[BD: Berger & Delampady, see above]*

☞ **David Colquhoun** On the contrary, I’m talking only about the case where there is NO hard prior info -the vast majority of published p values

☞ **Sander Greenland** If there is NO hard prior info WTF are you doing by inserting the hardest of hard info: a point mass!! This isn’t rational frequentism, Bayes or science; like 0.05 it’s a pernicious meme that has frozen some minds into pre-1940 statistics; a case study in arrested development.

☞ **David Colquhoun** There is no point in continuing this if you are going to get abusive. The statistics wars are getting out of hand. I’m asking questions because I’m interested in your views.

☞ **Sander Greenland** Sorry but indeed there is no point in continuing this if you don’t read what I actually wrote and aren’t open to the idea you & cites have huge holes in your reasoning re Bayes, spikes & P. Those holes delineate serious gaps in applied stat theory, holes many experts don’t teach.

☞ **David Colquhoun** I’m entirely open to other approaches -I look at 3 of them in *[**The False Positive Risk**]* Do you think you could call the spike maximally-skeptical?

☞ **Sander Greenland**** **I for one don’t think so since maximally skeptical is 100% spike at the null and that indeed looks like what we see in many topics eg among homeopathy skeptics. I do think there are cases like that where a big spike can be argued from physics, but 100% violates Cromwell’s rule…

☞ **Sander Greenland** …which is to say it forbids learning, period. Furthermore it arguably violates an opposing view’s right to a fair trial. But this is too deep for Twitter and I think irrelevant for everyday med research or anywhere spikes have no such physical basis (unlike in Jeffreys apps).

☞ **David Colquhoun** Obviously I don’t favour 100% spikes (except for homeopathy, where it’s true by definition !)

☞ **Sander Greenland** Sorry to be pedantic but no it isn’t true by definition even if the truth is that homeopathy is pure noise/nonsense, in which case the null is simply true (as a contrasting example 2=2 is true by definition of ‘=’).

☞ **David Colquhoun** the definition I had in mind is that the dilutions are such that the “real” pill and the placebo are identical, So FPR is 100%

☞ **Sander Greenland**** **OK, in my logician’s lingo that’s not “by definition,” that’s instead a deduction from a physical model for the dilutions, one in which the chance of even a single molecule of the “medicine” remaining in the homeopathic pill is slight.

☞ **David Colquhoun** It still really puzzles me why,if you say a spike 50% prior on the null gives the null an “unfair” advantage, then why why do UMPBT give similar answers? I’d be grateful if you could explain that/ *[UMPBT: uniformly most powerful Bayesian test]*

☞ **Stephen John Senn** Maybe I misunderstand. If so correct me. I was under the impression that the UMBPT *[sic]* plays with the alternative but that the formulation of the null is a spike. In that case, citing the UMBPT *[sic]* as the answer to those who object to a spike in the first place is irrelevant.

☞ **David Colquhoun** aha thanks for that. I re-read the 2 Johnson 2013 papers,and I think you are right. So it comes down to whether you think that a spike null prior=0.5 gives an “unfair” advantage to the null. I think that under many circumstances it doesn’t Most bright ideas turn out to be wrong *[Johnson 2013: **Uniformly most powerful Bayesian test**]*

☞ **Stephen John Senn**** **1/2 Yes. IMO main difference between Johnson and you is that he adopts an alternative set that somehow gives rejection the best chance given the null assumption. From memory, you have the same null assumption but use a somewhat different alternative based on power considerations.

☞ **Stephen John Senn**** **2/2 So basically, you are doing something that’s very similar, so it all boils down to the reasonableness of the lump prior. For much of what I do it would not be appropriate.

*[etc.]*

### Alternatives to Twitter

Some emails regarding the Nature paper referred to above surfaced on Prof. Andrew Gelman’s popular blog. They are more readable than Tweets, obviously, and useful for understanding the motivations of the authors. However, we seldom have access to such correspondence and, as I have discovered, it is difficult to rummage through Tweets.

Dr. Stan Young (an expert from the industry) commented on Gelman’s blog:

“I think a more open-ended and longer dialog would be more useful with at least some attention to willful and intentional misuse of statistics.”

I concur! Clearly Twitter is not apt for conversations of this kind, it is too impulsive, leading to outbursts of frustration. It is doubtful, too, whether people will notice a fleeting exchange before it is swept away from their feed or erased without detection. It is unlikely that opinion-leaders notice them: here is Dr. Milton Packer, “I do not have a Twitter account. I have never Tweeted.” It is possible to extract Tweets from an individual’s thread and roll them out on a blog page using the Thread Reader app, but no such thing exists for conversations because they are too intricate, as indicated by the need for symbols (☝☞⁕): a single comment may have multiple responses, and even multiple responses from the same author.

Regarding the reproducibility crisis and p-values: statisticians have been reproducing the same arguments against p-values for decades. It has become tiresome. The American Statistician published 43 papers by “forward-looking statisticians” in its recent issue. But it is too much soapboxing and not enough dialogue. I said previously on Twitter that “the comments on a [Deboraha Mayo/Stephen Senn] blog post are more interesting than anything you’ll read in a high impact factor stats jrnl”. Like the Tweets above, we desire earnest and direct conversation. The ‘reproducibility crisis’ may be perpetuated by a number of factors including idiosyncracies and a lack of transparency. A consultant biostatistician can attest to the fact that different companies have different statistical habits, likewise different disease areas; ideas migrate, but slowly it seems.

Pubpeer attracted some comments on the Nature paper, although it is not a platform for initiating conversations. The data methods message board created by Frank Harrell is an especially good place for discussing the practice of statistics. However, I would like to suggest a new platform specifically designed for long form public dialogue: Letter. Conversations on Letter are mediated by written public letters between two participants. The style is reminiscent of Sam Harris’ published email correspondence with Noam Chomsky and Ezra Klein, and I would like to encourage more scientists engage in this mode of public discourse.

Otherwise, feel free to share your thoughts on this debate in the comments section below, and get in touch if there is an error in the transcription. I am grateful that the authors of the Tweets left them in tact and did not erase anything, which some people are inclined to do. @PaulBrownPhD

### Appendix: timeline of recent events

*2015*: new call to resign p-values when 1/3–1/2 of 100 studies replicated in pyschology reproducubility project*2016*: the American Statistical Association responds with statement; Greenland et al. publish guidelines for p-values EurJEpidem*2017*: Assem et al.: “Outcomes reported in … abstract 3 x the odds of being significant … compared to text” CCTC journal*2018*: Nature, Benjamin et al.: redefine stat sig as 0.005; Ioannidis responds and rejects the idea in JAMA*2019*: 800 signatories (aimed at journal editors); Ioannidis responds in Nature- Twitter blows-up; over the weekend the Nature paper garners 18k tweets; The American Statistician current issue has >40 papers on the topic

Prof. Mayo’s book was recently published with the subtitle “How to Get Beyond the Statistics Wars”.