Andromeda and ‘appalling science’: a response to Hardwicke and Ioannidis

Published in

WintonCentre

9 min readJan 26, 2020

In March this year I was one of the 850+ signatories supporting the article in Nature calling for the idea of ‘statistical significance’ to be retired. Soon after publication, all these signatories received an invitation from Tom Hardwicke and John Ioannidis (henceforth referred to as H&I) to take part in a survey on our “experiences as a signatory of the petition to Abandon (Retire) Statistical Significance”. I chose not to complete it, partly due to the researchers having already expressed a clear view on the issue.

The results of the survey were published in October with accompanying commentaries. There was only a 29% response rate, suggesting many other signatories may have felt similarly to me.

I was interested to see the results for question 9, which given below with the number of responses to each of the options.

Q9. In a recent pre‐registered randomized trial comparing two septic shock treatments with formal power calculations for determining the necessary sample size to evaluate the primary outcome of mortality, the mortality was 43% for the older treatment and 35% for the newer one and the P‐value was .06. What do you think about the statement ‘the new treatment clearly reduced mortality’?

A. I strongly agree. The treatment clearly reduced mortality. Claiming that there is no difference between the two treatments is appalling science n=29 (12%)
B. I mostly agree n=55 (22%)
C. I tend to agree n=68 (27%)
D. I tend to disagree n=45 (18%)
E. I mostly disagree n=25 (10%)
F. I strongly disagree. Claiming that there is a difference between the two treatments is appalling science n=8 (3%)
Not answered n=18 (7%)

It is worth thinking about how you might have answered this. Note the most common answer is C: ‘I tend to agree that the treatment clearly reduced mortality’.

This example, and the reference to ‘appalling science’, rang a bell with me.

THE ANDROMEDA study

The question is presumably based on the ANDROMEDA study comparing a novel perfusion–targeted resuscitation strategy with standard lactate level–targeted strategy as treatments in the (historically intractable) high-mortality context of septic shock, which was published in JAMA in February 2019.

The protocol for this study specified a target enrolment of 420 patients, giving 90% power to detect a reduction in 28-day mortality from 45% to 30%, at a significance level of 5% (ie observing P < 0.05 in a two-sided test, or equivalently a 95% confidence interval for the hazard ratio excluding 1). In terms of hazard ratios, this corresponds to an alternative hypothesis of log (1-0.30) / log (1-0.45) = 0.60**. The eventual 28-day mortality was 74/212 patients (34.9%) in the peripheral perfusion group and 92/212 patients (43.4%) in the lactate group, and a formal survival analysis gave an estimated hazard ratio of 0.75 [95%CI, 0.55 to 1.02]; 2-sided P-value = .06. Thus the results, although apparently favourable to the new treatment, did not quite achieve the traditional threshold of P < 0.05 to declare ‘statistical significance’.

The paper’s discussion section appropriately says ‘a peripheral perfusion–targeted resuscitation strategy did not result in a significantly lower 28-day mortality when compared with a lactate level–targeted strategy’, but the conclusion in the abstract of the paper went much further:

Among patients with septic shock, a resuscitation strategy targeting normalization of capillary refill time, compared with a strategy targeting serum lactate levels, did not reduce all-cause 28-day mortality. [My emphasis]

This firm conclusion of ‘no effect’ found its way into the media, with headlines such as Peripheral Perfusion Fails to Cut Septic Shock Mortality and Peripheral perfusion-targeted resuscitation does not lower mortality in septic shock patients.

To me, this confident claim of ‘no effect’, when the results pointed towards benefit, perfectly exemplified the gross misuse of statistical significance, where an arbitrary threshold of 0.05 is used to dichotomise results into either a ‘discovery’ or ‘no effect’. So I was prompted to tweet -

The tweet has had 290 retweets and a long list of comments. I also used the term ‘appalling science’ when talking to a journalist from the Financial Times. So I suspected that my comments might have influenced the wording of Q9.

I remain highly critical of the conclusion expressed in JAMA that there was ‘no effect’ based on a two-sided P-value of 0.06, incidentally corresponding to a P-value of 0.03 for the alternative one-sided alternative hypothesis that the treatment improved. But note that I did not at any time suggest that there was clear benefit.

What might I have answered for Q9?

Had I filled in the survey, I would have had great difficulty in answering this question. I would strongly disagree with Option A, that ‘the new treatment clearly reduced mortality’, as it is only a single study and I don’t have enough background knowledge to draw such a strong conclusion. But, as my tweet shows, I would certainly agree with the second part of option A, that ‘claiming there is no difference between the two treatments is appalling science’. This might point me towards Option F, but I don’t think I would go as far as ‘Claiming that there is a difference between the two treatments is appalling science’. The question seems to be posed around two extreme alternatives, that the treatment either ‘clearly reduced mortality’ or had no effect. Both of these conclusions seem unjustified, and so I would have had to join the 18% that did not answer.

I was then somewhat surprised to find the following personal reference in H&I’s report of the survey;

Similarly, Question 9 probed a statement by an extremely influential statistician (also a signatory), David Spiegelhalter. In a Financial Times interview, Spiegelhalter, a towering giant in Bayesian statistics, described as ‘appalling science’ the conclusion of no difference in a large randomized trial of sepsis treatment that was published in JAMA. The trial involved formal power calculations and mortality was 43% and 35%, respectively, in the two arms with P = .06. About one‐third of signatories agreed with the interpretation in JAMA. If one has high scepticism and a low prior for effectiveness of sepsis treatments (many have been tested, but almost all have failed, despite frequent earlier promises) than the conclusion of no difference would be appropriate.

To be honest I find it difficult to understand what this commentary is getting at but, after a flattering build-up, H&I appear to be believe that my conclusions are misguided, and even that I was suggesting there was clear benefit. My response is:

H&I say that ‘about one-third of signatories agreed with the interpretation in JAMA’ . First, this should refer to ‘survey respondents’ rather than ‘signatories’, especially with a 29% response rate. But my major concern is that the responses D to F appear to have been interpreted as supporting JAMA’s conclusion that there is ‘no effect’, which seems entirely unjustified: again, just because one tends to disagree with the conclusion that there is a clear effect, does not mean you believe there is no effect!
As described above, the question focuses on two extreme conclusions, ‘clear benefit’ and ‘no effect’, neither of which seem tenable. H&I appear to be suggesting that ‘no difference’ is the appropriate conclusion. This seems to embody the sort of dichotomy that the Nature letter was trying to highlight and condemn.
H&I also appear to suggest a Bayesian perspective should conclude that there is no difference, which I would dispute using the reasoning below.

A Bayesian perspective.

So what Bayesian view might be reasonable? First, a rather naive uniform prior on the log-hazard ratio would lead to a 97% posterior probability that the new treatment was beneficial, which might even be interpreted as ‘clear benefit’. But, as H&I correctly point out, there is a long history of failed innovations in this context, and a substantial degree of scepticism is appropriate.

One way of incorporating such scepticism is to use a prior with a ‘spike’ on ‘no effect’, and a distribution over potential effects were this ‘null hypothesis’ false, which leads naturally to a Bayes factor approach. But since every treatment must have some effect on average, if only minor and of no clinical importance, my preference is to use a continuous prior, centred on ‘no effect’, and expressing scepticism about substantial treatment effects. As we suggested back in 1994, an indicative sceptical prior might, for example, express the judgement that the alternative hypothesis chosen for the study was grossly optimistic: this is shown in Figure 1 below, where the prior probability that the treatment has a hazard ratio of 0.60 or lower is set to 1 in 500**. This might be interpreted as the naïve posterior distribution (with a locally uniform prior) arising from a fictional initial study with around 126 events, 63 in each treatment arm, and hence showing no effect at all. This sceptical prior exerts a severe handicap on any subsequent data.

Figure 1: Bayesian analysis of Andromeda mortality data, using a sceptical prior that assign 0.2% probability of the effect of the new treatment being more extreme than the alternative hypothesis of a hazard ratio of 0.60. The likelihood summarises the information from the trial data — with a locally uniform prior distribution, this would lead to a 97% probability of treatment benefit. Combining the sceptical prior and the likelihood using Bayes theorem gives rise to the final posterior distribution, that expresses a 92% probability that the treatment reduced average mortality, and 0.2% that this effect is more extreme than the alternative hypothesis of 0.60

This sceptical prior would lead to a posterior probability of 92% that the treatment reduced average mortality, and 0.2% that this effect is more extreme than the alternative hypothesis of 0.60 (unchanged from the prior)**.

Clearly deeper research and background knowledge could lead to a more reasonable sceptical prior, but based on this indicative analysis I would finally conclude:

there is some evidence for benefit, but more research is needed to replicate the study and allow a combined analysis.
it is appalling science to claim there is ‘no effect’.
I strongly suspect that, had the data been trivially different and P turned out to be 0.04, the conclusions and coverage would have been entirely different, with claims of a ’discovery’. This would have been as bad as the claims of ‘no effect’. Note that a Bayesian analysis would have been minimally affected.
the wording of H&I’s Q9, and their interpretation of the results, do not seem to add clarity to the issues, and instead appear to reflect precisely the dichotomia that the Nature paper was trying to counter.

*****************

Before publication of this blog, I asked for any comments by H&I, and John kindly replied as follows:

Thank you for the excellent blog on this contentious issue. Our survey was an effort to try to see what our colleagues who signed the Nature petition think, and how much consensus or diversity of opinion and interpretation exists. We tried to mostly show the results of the survey without much commenting from us, since people may interpret them in different ways. However, it is probably fair to say that there is substantial diversity of opinion for this particular question (and for many other questions on the survey) among those people who responded. We need to recognize that people who sign a petition often have very different things in mind and they may agree more on some core issues but disagree on many others. The petition silences this residual, often prominent, disagreement.
Of course we never argued that the responses to the survey would be the “truth”. The signatories who replied are a non-random sample of the signatories, and the signatories are a non-random <0.01% sample of the 35 million people who author scientific papers.
My personal opinion for what little (infinitesimally little indeed) is worth is that JAMA handled the study pretty appropriately. I would have voted for option E probably. Treatments for septic shock are a classic example where we have run zillions of trials and we know that almost nothing works, despite early hype for several treatments of this sort. Your prior is not sufficiently skeptical for this kind of situation. A tall spike on the null would probably be appropriate. Now, you argue that “But since every treatment must have some effect on average, if only minor and of no clinical importance”, but for this type of treatments, this statement is spurious. I feel far more comfortable to just consider that the vast majority of these treatments have no effect. If so, your “skeptical prior distribution” is not really skeptical for these circumstances, it allows for a substantial proportion of treatments to achieve 10% or more relative risk reductions, while based on what we have seen I would have thought this to be extremely unlikely.
Thank you for the opportunity to comment.
Cheers!
John

** Added 29th January. This alternative hypothesis comes from the fact that, if S1(t) and H1(t) are the cumulative survival and cumulative hazard under treatment 1, then S1(t) = exp(-H1(t)), and so under a proportional hazards model, the hazard ratio HR = H1(t)/H2(t) = log S1(t) / log S2(t). In my initial posting I mistakenly used HR = log 0.45 / log 0.30 = 0.66 instead of the correct log 0.7/ log 0.55 = 0.60, and used a sceptical prior with 1% probability of being more extreme than the incorrect alternative hypothesis 0.66. I have revised the graphic, but kept the same sceptical prior as previously and hence the Bayesian posterior is unchanged.

Andromeda and ‘appalling science’: a response to Hardwicke and Ioannidis

Written by David Spiegelhalter