Guest post: Matthew Stephens on Biostatistics pre-review and reproducibility

Editor’s note: This is a post by Matthew Stephens who is the first person to have a paper pre-reviewed by Biostatistics for publication.

Today saw the publication in Biostatistics of my paper, “False
Discovery Rates — A new deal.
”. I’m obviously very excited to see this work appear in a journal. However, I’m particularly pleased to see it appear in Biostatistics, because of the leading role that the journal and its editors have taken in promoting “reproducibility” — an issue that has been occupying a fair amount of my time recently.

The paper’s path to Biostatistics was an interesting and, at least at the
current time, unusual one. Once I had a complete draft that I was happy with I posted it on the preprint server biorxiv.org. However, I was undecided about which journal to submit it to. I had considered Biostatistics, mostly because of its focus on reproducibility, but the paper was longer than they allow — and I knew from experience that they were strict on this, since they had previously refused to review a different paper I had submitted there based on length alone. So I decided to take a little time to think about it and consider my options.

This was on January 29th, 2016.

One week later (Feb 6th) I received an unexpected — indeed, quite
astonishing — email, from the editors of Biostatistics. It said that they had seen mypreprint, thought it could be suitable for Biostatistics, had had it reviewed by two referees (reports attached), and that they would like to offer conditional acceptance for publication in Biostatistics if I could satisfactorily addressthe referee concerns. Of course, once I had had time to recover from the shock, I gratefully accepted their offer!

My experience could be held up as one illustration of the benefits of using
preprint servers. And indeed when people ask me about preprints, and whether I worry about journals not wanting to publish material that has appeared as a preprint, I do tell them this anecdote. However, it does not yet seem like a generally viable strategy to simply deposit your work as preprints and wait for the journal editors to come calling. Indeed such a day may never come. But at the same time it is clear that publishing models are changing, and the kinds of experience I just described may at least become more common, if not the norm. One could certainly see benefits for both journals and authors. Journals that want to publish good papers could do worse than scan the preprint servers for interesting-looking material (much the way many journal editors, at least in the biological sciences, currently scout out conferences looking for interesting work). And although the author reaction might depend on the reputation of the journal, it is certainly easy to see the attraction of this system to authors.

I mentioned above my interest in reproducibility, and I wanted to take this
opportunity to say a few words about what motivates this. For me, the importance of reproducibility is not simply, or even perhaps primarily, its role in checking the integrity of a piece of work. Rather, I see reproducing a piece of work as the first step towards building on it. That is, reproducible research is “extensible research”. If I give a new graduate student a statistical methods paper, either from my lab or elsewhere, and she has ideas for interesting next steps, think how much more easily and quickly that student will be able to test those ideas if they are first able to reproduce the key results from the paper. If, when a student graduates from my lab, all that they leave is a written thesis and published journal papers, then it makes it very challenging for us or anyone else to take the next step. If you want people to build on your work, I think it is worth investing some effort in making that easier.

At the same time, it must be acknowledged that the effort required to do
extensible and reproducible research is non-trivial. Indeed, it requires not
just effort, but also skills and knowledge that are not (yet) a key part of most
graduate curricula. During my work on the “New Deal” paper, I tried to learn skills to help with this endeavor — including version control (which I had used before, but not in a very systematic way), and literate programming tools such as knitr/Rmarkdown, which I found a great boon.

Even then, despite my efforts to become more systematic and organized, I found that some law of thermodynamics tended to get the better of me, and my file directories eventually degenerated into a state of which I was not proud — with figures and code spread out over several different subdirectories. In my experience from working with others this
is almost inevitable without (or often even with!) careful training. Inspired by work of John Blischak, a graduate student at the University of Chicago, I
reorganized my work into a state that is a little less anarchic — you can see it
here (https://github.com/stephenslab/ash). My ultimate goal is that all work from my lab should be published along with a repository along these lines. But from personal experience I know that this is not easy. And if you try to reproduce my work I can almost guarantee you will encounter issues on which you will want or need my input. Please do feel free to ask for help by opening an issue on the appropriate github page, so that your experiences can benefit everyone!

With thanks to the Editors of Biostatistics, and to all the people who have
(often unknowingly) inspired me with their examples and thoughts on
reproducible, extensible and open research, including John Blischak, Carl
Boettiger, Karl Broman, David Donoho, Rich FitzJohn, Tim Flutre, Roger Peng, Hadley Wickham, Greg Wilson, and Yihui Xie.