The most illegal blog post in the history of peer review

Jordan Anaya
9 min readFeb 12, 2018

Update 20180214: The author is apparently still working on responding to the reviewer comments.

I’m about to do something extremely naughty. Shit, I might get in trouble. But peer review is currently stuck in the dark ages, where previously open peer review reports can get redacted:

To help save peer review someone has to push the envelope, and ASAPbio meetings just aren’t cutting it.

I already publicly posted my review for this paper, which proved to be too controversial for bioRxiv to handle.

Well, that was nothing. Because now I’m going to post the reviews of the other reviewers.

This is a bit more of a complicated topic. Before, there was no doubt I had the right to post my own review. I wrote it, and I never told anyone I would keep it confidential. Sure, journals probably want you to believe you don’t own your reviews. But you do, so fuck them.

Now the question is whether it is okay to post the reviews of the other reviewers. This could very well be considered a private correspondence that shouldn’t be made public. And for journals that don’t have open peer review reports this is a strong argument — these reviews were never going to see the light of day unless one of the reviewers decided to post their review somewhere (at the chagrin of the journal).

But, PeerJ has open peer review, so if the paper was accepted these reviews would have been publicly posted. Just think about that. If the editor had read our reviews and decided to accept the paper, these reviews would already be public. If a journal has open peer review why should the status of the manuscript determine if the reviews get posted?

An even thornier question is whether I should name the reviewers. I think its fine, because as I said, these reviews easily could have been public already in an alternate universe. The only issue is attaching someone’s name to text without a public source would require me to provide evidence that yes, they said these things. And it doesn’t seem worth the trouble to take a bunch of screenshots of the emails.

***Unfortunately it is easy to identify one of the reviewers based on his review, so I guess I should hedge by saying these may or may not be actual reviews***

Reviewer 1

Basic reporting

The grammar is mostly satisfactory, although there are exceptions, “Deeper looks at trials…”, “in in”, “Factors effecting”, “is application”, “If this is, than the precision”.

The spelling is poor, but could be corrected easily enough, for instance “catgorize”, “calcuating”, “calcuation”, “don”, “nessesarily”, “nessearily”, “nessecarily”, “nessearily”, “nessesitated”, “nessessity”, “distribtuions”, “futher”, “jounrnal”, “dicuss”, “aluded”, “multplication”, “presense”, “possiblity”, “retracte”, “erroroneous”, “stategy”.

The references and background are OK, as are the figures, tables and shared data.

The article is logical and understandable.

Experimental design

The design is OK, although improvements could be made:
1) There is no need to assume five variables per trial when simulating the effect of correlation. The author could write code that allows simulation of correlation using the number of variables per trial and their mean (sd) and group numbers reported in the appendices I provided i.e. repeat analyses for 5087 trials a number of times.

2) There is no need to assume a correlation of 0.33 between variables. The author could write code that repeats the simulation above with various correlation coefficients. Importantly one of these simulations should use a correlation coefficient of zero, as well as — for instance — 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45 and 0.50.

3) The behaviour of Stouffer’s method does not have to be indirectly inferred from the distribution of p values for the first variable reported for each study. The author could explore the behaviour of Stouffer’s method with simulated data. However, I don’t think the author should waste his time doing so because Stouffer’s method does generate a uniform distribution of aggregated p values, irrespective of the number of variables, as long as the simulated p values are themselves uniformly distributed. As the author speculates, the distribution of the 5087 trial p values isn’t uniform because the individual variable p values aren’t distributed randomly between trials, whether due to correlation, stratification or some other mechanism. I would suggest that the author runs a simulation that uses the individual p values for 29,789 variables, and allocates them randomly between the 5087 trials. I would anticipate that this would result in a uniform distribution of 5087 ‘trial’ p values.

4) The author’s categorisation of retracted studies isn’t correct:
i) Sato is not an anaesthetist, he investigated bone mass and associated topics;

ii) Studies retracted for anaesthetic authors haven’t been uniformly for data fabrication: most of the trial retraction notices for trials by Boldt have been justified due to “lack of ethical approval”, not because of fabrication. Not all studies by Fujii have been retracted and they have not all been recommended to be retracted.

If the author wants to be correct he will have to review each retraction notice published in the journal retracting the trial. The author’s results might be affected by correcting the categorisation of these retracted articles. The author might have to run more than one analysis because it is unclear how one might categorise some studies: for instance, although most of Boldt’s trials haven’t been retracted for data fabrication, a recent review by a German university suggested that his trials should be treated as if they did contain fabricated data.

Validity of the findings

The findings reflect the extent of the experimental design. If the author adds the analyses I’ve suggested the findings will be added to, although I expect the current findings will be unaffected.

With the exceptions I’ve mentioned above the findings are sound.

The author slightly misrepresents my work and the consequences of my work;
1) No trial has been retracted because of an unlikely distribution of baseline data i.e. ‘a small CM p-value’. Trials have only been retracted after investigations by the employing authorities of the authors of those trials. My papers analysing the data in Fujii’s trials and Saitoh’s trials have only been published after those authors had already been found guilty of research misconduct by their employers. Therefore the phrase “contain retractions that occurred directly as a result of being identified by the CM” is incorrect. The retractions either occurred indirectly as a result of the CM (because my work stimulated a more extensive review of the work of Fujii and Saitoh) or decisions to retract trials (or ask for their retraction) had already been made by the time I published.
2) I have not myself suggested that a small CM p-value is due to fabrication, and I hope that the author would make this explicit. I have listed benign reasons for small CM p-values, including problems with the assumptions of the model, such as correlation, which I explored in the appendices of my 2015 paper (using a range of correlation coefficients, as I’ve suggested above). I have also not suggested that journals should screen submissions and reject on the basis of a small CM, although I do support the practice of asking for raw data and I support the author writing “If journals choose to implement screening procedures prior to publication, authors of papers would be able to respond to the results of the CM by reanalyzing the raw data, thus potentially giving more definitive answers as to the validity of certain assumptions made by the CM”, such as correlation (or rather, its absence).

Comments for the Author

Thanks for your work. I’ve made some comments that I hope you will find useful and that I hope will improve your paper, at least a little.

I would appreciate it if you might make it explicit that I was aware of the various flaws in the model and why ‘benign’ mechanisms might account for most (or perhaps all) of the disparity between the observed and expected distribution of trial p values. As you know, I discussed a range of such mechanisms in my paper. I did consider using an arbitrary correlation coefficient across all 5087 trials to make the results of the model better match the observed distribution, but I think that there is value in assuming no correlation as the default because the analyses of results assumes no correlation with baseline variables (or only if stratification has been implemented), or even between results (if more than one is reported).

The ‘threshold’ p-values I mentioned in my paper were for illustration, illustrating as you point out the problems of any threshold in identifying one category of paper from another.

I don’t know whether all my comments will be passed on to you in full. Please feel free to email me if you’d like to discuss them further.

Reviewer 3

Basic reporting

The primary problem this paper suffers from is its structure. There is some redundancy in the language that should be removed. My full recommendations to change this for resubmission are outlined below.

Experimental design

The explanation of the procedure performed is not clear from the text as it stands.

Validity of the findings

Difficult to determine at present.

Comments for the Author

This paper is a critical re-analysis of the method presented by Carlisle (2017) and earlier to identify discrepancies in clinical trial reporting (amongst other things). This is an under-served area of research in general, and critical evaluation of the tools used, as well as their development and extension, is absolutely necessary at present. My opinion is that this manuscript should be retained, and improved as much as possible.

That being said, the primary problem it suffers from is language, structure and presentation. At present, even to someone familiar with the technique and the area — and this is a very small population of researchers — the manuscript is opaque. For meta-scientific methods to be more widely received, it is absolutely critical that anything written about them is accessible to people with normal statistical competency and no prior familiarity with the techniques and area.

Thus, the paper: it is written more as a stream of consciousness than a structured manuscript. I have several points to make regarding this. My ideal scenario is that having remedied this, we can move onto a closer evaluation of the method, and then to publication.

Consider the below:

“Methods and Results. To assess the properties of the method proposed in (Carlisle, 2017), I carry out both theoretical and empirical evaluations of the method. Simulations suggest that the method is sensitive to assumptions that could reasonably be violated in real randomized controlled trials. This suggests that deviation for expectation under this method can not be used to measure the extent of fraud or error within the literature, and raises questions about the utlity [sp] of the method for propsective [sp] screening. Empirically analysis of the results of the method on a large set of randomized trials suggests that important assumptions may plausibly be violated within this sample. Using retraction as a proxy for fraud or serious error, I show that the method faces serious challenges in terms of precision and sensitivity for the purposes of screening, and that the performance of the method as a screening tool may vary across journals and classes of retractions.”

This is an overview rather than a summary — if someone not intimately familiar with the issues read it, for them it would be close to totally uninformative. It does not state (a) what the simulations are (b) what the empirical analysis is and © what the loss of precision and sensitivity is.

* This has only the briefest of descriptions of what the method at the center of the paper is. I would strongly recommend including a worked (hypothetical) example — a typical table of values, the theory and calculations made to them, the aggregation of p-values, the necessary transformation (Stouffer’s Z). Launch an explanation for the problems outlined in terms of this example.

* The method in question was the genesis of investigations results in approx. 250 retractions. In the sense that it has previously identified serious malfeasance successfully in the past, surely this effectiveness deserves some mention.

* The point about the ‘danger’ of false positives is, to my mind, overstated. Techniques for screening summary statistics are not framed in terms of absolute methods of defining errors, they are presented as *probabilistic* techniques. How they are perceived when pointed out is the problem. Thus, false positives cannot practically be eliminated. They should, of course, be minimised.

This needs to clearly outline, in a numbered or lettered list, the analyses performed. Almost no readers will retrieve the necessary code and run it as described, so the description needs to be much more general / explicit.

This starts with a description of the method which should be mentioned in the introduction, and then proceeds to a combination of introducing both methodological details and results.

This mostly follows, with some redundancy.

Addressing all of the above should permit a readable manuscript, from where we can proceed to technical questions. I strongly recommend the author consults an external editor or co-author in order to present the information above more coherently.

Other points:
* spelling and capitalisation

268 the CM don not
277 based on it retraction status.
349 without nessecarily being obvious
379 First, It may be that this heterogeneity arises from heterogeniety
399 performance of the CM is overestimated in in the journals
406 to identify some retracte trials … erroroneous
419 nessesitated

This is not a comprehensive list. The whole manuscript needs to be carefully proofread. Ensure all references are included.