I knew bioRxiv wouldn’t post my peer review, and that’s the problem

Update 20180214: The author is apparently still working on responding to the reviewer comments.

I recently noticed a tweet from one of the directors of bioRxiv:

The tweet was good timing, because I had recently performed a review for a journal and wanted to make the review public. Where better to post a review than on the only version of the paper that is public? So I went ahead and attempted to post my review:

However, my comment was not approved and I received this email:

This wasn’t surprising — bioRxiv likes to remain uncontroversial and invents policies whenever they need to. In this case bioRxiv isn’t the problem. The problem is that it is controversial to publicly post a peer review in the first place.

Now, before people accuse me of just being a rabble-rouser and doing a disservice to the author, I’d like to point out that I could have posted my review as soon as I wrote it, but I didn’t think it was fair to post criticisms of a paper before the author had a chance to respond, so I waited for the peer review process to play out and only posted my review after it was clear the author wasn’t going to respond to my review.

In the current peer-review system if a paper gets rejected the reviews are lost to the publication Gods, and when the paper is submitted to a new journal an entirely new set of reviewers has to review it. This wasteful cycle continues until eventually the paper is published somewhere…anywhere.

The Brian Wansink saga has caused a lot of hand wringing about the quality of peer review. How could THESE papers pass peer review?

The answer is maybe they didn’t, at least not initially. Maybe they were rejected by several journals before eventually getting published. Maybe diligent reviewers noticed all the calculation errors, discrepancies in the methods, data reuse and text-recycling, or just plain uselessness of the research. But these damning reviews didn’t follow the papers to their next destinations.

It would be like if a public figure, such as the president, wanted to claim to be in perfect health. All they would have to do is visit doctors until they found one that said what they wanted. And because of HIPAA the other doctors wouldn’t be able to come out and reveal their findings.

That’s where we’re at in the peer-review ecosystem — it is basically a HIPAA violation to post your diagnosis of a paper.

Now, it is a little more complicated than that, because bioRxiv will allow you to post your review as long as you don’t mention what journal you reviewed for. But why this distinction?

I kind of understand why the author (and maybe even the journal) wouldn't want people to know where the paper was sent. No one wants sloppy seconds, so if it is made public knowledge that your paper was rejected by a journal like PLOS One, then it will have a permanent stink attached to it.

But this is a vestige of the prestige system of journals. The fact that it is more embarrassing for people to know what journal your paper was rejected from than the content of the criticisms should tell us something.

Although PeerJ has open peer review, it may be against their policy to post reviews for papers which did not complete the review process. But for a revolution you need revolutionaries.

I reviewed this paper, and this is my review:

Basic reporting

There are numerous typos and grammatical issues throughout the article, but for the most part the text was understandable. I noted some of these typos in the attached PDF.
 
The author is clearly well-versed in the literature of anomaly detection, and has a good grasp of the issues the field faces. I realize that this article was likely written before these references were available, but the following two references are glaring omissions:
 
“Errors and Integrity in Seeking and Reporting Apparent Research Misconduct”, DOI: 10.1097/ALN.0000000000001875
 
“An Appraisal of the Carlisle-Stouffer-Fisher Method for Assessing Study Data Integrity and Fraud”, DOI: 10.1213/ANE.0000000000002415
 
I’d like to see at least a mention of these two other critiques of the Carlisle Method.
 
The article structure is acceptable, and the code was made publicly available.
 
The results were relevant to the research question.

Experimental design

I have absolutely no idea what the scope of PeerJ is (I’ve submitted an article that was turned away for not being within the “scope”), so I’m not sure if this article falls within the scope. However, I do note that PeerJ has published at least one article that I believe is similar: https://peerj.com/articles/3068/
 
This article seeks to shed light on how results from the Carlisle Method should be interpreted by both performing simulations and analyzing actual outcomes. Similar to other researchers, the author identifies various pitfalls of the method.
 
The simulations performed by the author were not performed to a high technical standard (see “general comments for the author” below for more details).
 
Although the code was provided, the methods were not described sufficiently in the text or figures (see “general comments for the author” below for more details).

Validity of the findings

I generally agree with the author’s assertion that the Carlisle method should be used with caution, which is echoed by various other commenters.
 
The code provided by the author contains at least one blatant error (in function “generate_lnorm_summary_stats” rnorm should be rlnorm).
 
See “general comments for the author” below for more information.

Comments for the Author

Upon opening this manuscript I was shocked to see the number of spelling errors. Sending out a manuscript for review in this state does not reflect well on the editorial staff at PeerJ. I am curious to know at what stage in the publication process the author and PeerJ were planning on fixing the manuscript. To see some of what I’m talking about you can view the attached PDF.
 
Although typos are inconsequential as long as they don’t affect the meaning of the text, in my experience sloppiness in one aspect of a paper suggests that there could be sloppiness in other aspects of the paper. As a result, the typos put me on high alert for other problems, and upon looking at the code provided by the author I immediately spotted a blatant error.
 
The function “generate_null_summary_stats” is identical to “generate_lnorm_summary_stats”. It appears the author meant to copy and paste his code and change “rnorm” to “rlnorm”, but forgot to do so. As a result, the statement in the text: “First, I simulate data from log-normal distributions instead of normal” and accompanying Figure 1 A-C are in error.
 
Unfortunately I don’t program in R, so it is time-consuming for me to check every line of the author’s code, and as a result I was unable to check the rest of the author’s code for errors.
 
However, at this point I was concerned enough by the author’s work that I decided to run some of my own analyses (in Python) and come to my own conclusions about Carlisle’s Method (CM).
 
As someone who performs a lot of anomaly detection with various techniques, I am very familiar with how rounding can introduce uncertainty, and was happy to see the author investigate the impact of rounding on CM. However, when I went to reproduce the author’s Figure 1D, I was unable to with the author’s parameters of 5000 trials, 20 draws, mean=100, sd=10, 10% extreme rounding.
 
Upon close inspection of the author’s code I discovered the discrepancy and found it very troubling.
 For 90% of the trials the author did this:
 generate_null_summary_stats, 20, 100, 10, 2
 
 and for 10% of the trials the author did this:
 generate_null_summary_stats, 20, 10, 1, 0
 
In addition to changing the rounding from 2 decimals to 0 (“extreme rounding”) the author changed the mean to 10 and the sd to 1. With this information I was able to reproduce the author’s Figure 1D.
 
In the text the author did not mention that he changed his simulation parameters to obtain his results, which is odd considering that the author admitted in the text that he “chose the parameters for the simulation intentionally to make the point that correlation can result in a similar p-value distribution” when discussing Figure 3. This suggests the author may not realize that the impact of rounding on CM is dependent upon the simulation parameters.
 
Whether or not rounding is “extreme” is a function of all of the simulation parameters (mean, sd, sample size, rounding decimals). For example, with a small enough mean and sd, or large enough sample size, even 2 digit rounding is “extreme” and affects CM.
 
As the author showed, it is not hard to find a situation where rounding causes the p-value distribution to deviate from a flat distribution. This highlights the importance of stating one’s simulation parameters, and the need to use parameters which are as close as possible to those that will be encountered in clinical trials.
 
For example, if every single clinical trial is in the “sweet spot” of parameters where rounding doesn’t matter, then a simulation showing problems with rounding is pointless and the equivalent of fearmongering.
 
As a result, when I performed my simulations I used the exact same parameters found in the trials, i.e. I bootstrapped the means, sds, sample sizes, and rounding decimals. The sample sizes in the trials range from a handful of participants to thousands. The means and sds range from tenths to hundreds of thousands.
 
I’m not sure if the author, Carlisle himself, or other commenters realize the impact rounding can have under some of the more extreme parameters. Indeed, in my simulations I found that even 2 digit rounding can be “extreme”.
 
As a result, I think the author should dedicate much more time in the manuscript to the impact of rounding instead of a portion of one figure. There is some set in the four dimensional space of means, sds, sample sizes, and rounding decimals where rounding is insignificant. It will likely be difficult to describe this set perfectly, but it should be easy to come up with general guidelines.
 
The effect of rounding can be easily seen when calculating ANOVAs. When the means are equal, an ANOVA immediately spits out a p-value of 1.0 regardless of the other parameters. Rounding results in means appearing to be equal when they aren’t, and this is reflected in a pileup of p-values at 1.0 (which is demonstrated in the author’s Figure 1D).
 
The question of course is whether a different method for calculating p-values can get around this problem. I’m not familiar with Monte Carlo Simulations, and didn’t get a chance to try and implement my own, but the author’s Figure 1E is exactly what I would expect. “Extreme” rounding causes the simulator to get the same result every time so it just throws up its hands and spits out a p-value of .5. When Carlisle and the author choose the closest p-value to 0.5 between an ANOVA and Monte Carlo, they are effectively choosing the Monte Carlo over the ANOVA.
 
I would argue that both the Monte Carlo and ANOVA are equally useless in these situations, and these p-values should simply be ignored and not included in the aggregate p-value calculated with Stouffer’s method. It is not hard to imagine that there are fraudulent studies in the Carlisle data set, but are not detected as such because some of the measures had parameters that produced uninformative p-values, and these useless p-values masked the trend that would have been observed by only looking at the informative p-values (and correctly identified the researchers as frauds).
 
In fact, the rounding situation may be even worse than I currently suspect. In my simulations I did not take into account that a mean such as 1, if reported to 0 decimals, could have been any value between .5 and 1.5. I’m not sure if Carlisle’s Monte Carlo method takes this additional uncertainty into account.
 
Another hidden source of uncertainty occurs when standard errors are converted to standard deviations. In the data set Carlisle performed the conversions for us, and they are easy to spot since they are reported to high precision. The uncertainty I just mentioned with rounding, i.e. 1 could be anything between .5 and 1.5, is expanded when this range gets multiplied by the root of the sample size.
 
I would like to see rounding uncertainty, including that compounded by SE to SD conversion, incorporated in the simulations by the author.
 
I realize that simulations performed by other commenters, or performed by Carlisle himself, were likely similar to the author’s simulations, and drew the same number of draws from the same box a certain number of times without thinking about what parameters they are likely to encounter in clinical trials, or considering the uncertainty caused by rounding, or the effects rounding has on ANOVAs and Monte Carlos. But I would argue that these other simulations have all been meaningless.
 
If we are going to compare the results of simulations to the results observed with clinical trial data to determine which clinical trials are unusual and warrant investigating, we need faithful simulations. This is not optional. Given the importance of the simulation parameters, it is also imperative that the author clearly describe his simulations and clearly note in each figure the parameters.
 
In addition to clearly describing the default simulations, I would appreciate it if the author provided more details in the manuscript about the simulations with correlation and stratification. I’m not familiar with how clinical trials are designed, so I don’t know if there are well-established stratification procedures. I’d be interested to hear how the author decided on his stratification, and if he based his decisions on any relevant literature. I would also be appreciative if the author provided more details in the text about the Monte Carlo method he adapted from Carlisle. 
 
I enjoyed thinking about the author’s presented precision-recall curves. Unfortunately I’m not familiar with these types of plots and didn’t get a chance to reproduce them myself, but I found it strange that the author relied upon Carlisle’s calculated p-values when the author had calculated his own p-values in previous analyses. No offense to Carlisle, but I would not rely upon his or anyone else’s calculations. For example, in NEJM trial number 452, Carlisle has a fatal error in the “measure” column. In addition, Carlisle in his text states that he used a t-test, ANOVA, and Monte Carlo to calculate p-values. This is simply a silly statement, given that an ANOVA with two samples is equivalent to a t-test, and a t-test can’t be performed when there are more than 2 groups. I find Carlisle’s description of his calculations lacking and wouldn’t rely upon his results if I were the author.
 
I found the implications of the precision-recall curves for the different journals very provocative. The idea that CM may perform better in some journals is conveniently consistent with my pet hypothesis that the trial parameters can severely impact the performance of CM. For example, perhaps in some fields the measures have means and sds such that rounding problems are not manifested.
 
I’d be interested in seeing a more detailed comparison of the different journals. For example, do certain journals tend to have larger sample sizes? If the author is interested in testing my rounding hypothesis he could perhaps separate the trials into different bins based on how “extreme” the rounding is (which again is a function of all 4 sample parameters) and then perform his precision-recall analysis.
 
 __________________
 
This paper appears as a preprint, so I already consider it “published”. Whether or not PeerJ wants to “publish” it is equivalent to asking if they want to have their brand associated with the work. Personally, if I were to have my brand associated with this work I would need the following to happen:
 
1. Fix up all the typos and grammar issues in the manuscript.
 
2. Perform more advanced simulations using the actual trial parameters, ideally also taking into account rounding uncertainty. These simulations can be used to see which p-values are uninformative, and it would be interesting to see how removing these affect the precision-recall curves.
 
3. Clearly describe the simulations performed, both in the text and the figures.
 
4. Perform a more thorough analysis of the differences observed in the different journals.
 
If the author is interested in dedicating more time and energy to this manuscript I believe it could provide important insights and user guidelines for CM which are conspicuously absent in other published critiques of CM.