My views about Science

nowadays known as Open and Reproducible Science

I got invited to the TGAC — AllBio: Open Science & Reproducibility Best Practice workshop. On their form they asked me three question. I got a bit carried away with the answers and that’s how this post was born.

Please name three topics you believe are a priority towards a roadmap for open science and reproducibility of Bioinformatics?

Good practices, tooling and training

Possible subtopics for discussion:

docker run geneticTest genome.fastq

where geneticTest would be a container with all the tools and pipelines setup that would just print the result.

  • More Software Carpentry, workshops, teaching, hackathons. Most biology scientists didn’t learn how to code or are aware of the current best practices. The more people we have trying to do bioinformatics the right way (not just pushing the button of a commercial black box) the more vocal the community will be.
  • What is Mozilla Science Lab doing and how can they help? Any other organisation?

More incentives and awareness for open/reproducible science in academic organisations

That is, how can we make the “non computational people” who are in charge of important science decisions understand why it takes so long to get data and process it, before you actually start getting scientific results and answers.

Science should work as an open source project

Code should be seen as a worthy scientific output and scientific results should be treated the same way code is in open source.
That is, without the tools you can’t do science. Give people time to work on the code and reward them accordingly (include code usage metrics in grants/ results evaluation).

Scientific results should be published right after being created and not only when people think they’re “good” (highly subjective). Once released, they should stay available forever (Figshare/GitHub/Dat) and should be incrementally fixed/improved. Eventually, they should be labeled “relevant” at some point (equivalent to “stable” in code) and pushed to a journal or website for more peer-review.

Papers should be seen as merely “snapshots” or “blog posts” of scientific results, with the purpose of facilitating peer-review from scientist not actively following the project. Ending with the current publication status quo doesn’t mean ending with peer-review. We still need to know which results we can trust more. Papers shouldn’t be seen as the only worthy end product of research, leaving anyone that wishes to build upon that result with the task to rebuild the computational methods from scratch.

In an Open Source model for Science, everyone could contribute to improve a result and get attribution from an open and transparent system. Open Source already solved most of our problems, what we need is a cultural shift in Science.

How and why are you involved in efforts related to open science and reproducibility?

TL;DR: I’m working on a project called Bionode, building reusable bioinformatics Node.js modules that work on the server and browser. I’m currently collaborating with BioJS to avoid duplicating efforts. BioJS intends to represent all biological data on the web. I’m also working with Dat, which aims to be a Git for data and sharing of large datasets with reproducible workflow tools (e.g. importing, converting).

With this work and collaborations I hope to gather the necessary tools to do large scale research in population genomics.

Long answer:

Why:

var phd = science
if (science !== open + reproducible) {
console.log('BULLSHIT!')
}

It’s crazy that we call “science” to projects and research that we can’t reproduce or see the data. Despite some progress, open/reproducible science advocates are still seen as the “hippies” or “rebels” by many academics. It’s even crazier when you think that the internet was mostly invented for scientist to share their results, not for cat pictures (I have nothing against cat pictures).

Personally, I feel like I don’t trust much of anything. At some point I just compromise and make the assumption that a specific dataset is right and specific software works, and try to build something that I can have some degree of confidence in there. I just want that everything I build on top of previous work to be open, reproducible and clear. In six months, I want to be able to look back and quickly understand what I have done. If anyone finds errors in my work, these should be trivial to fix and rerun the experiment. If I actually did everything right but the data/software I used was the wrong one, it should be easily swapped without breaking everything. We definitely need better modularity in science.

I also strongly believe that the best scientific output format should be dynamic figures, like in this amazing blog post by Jure.

How

I’ve done bioinformatics for four years before starting my PhD. During that period I had the chance to see how things can go wrong if you don’t make them reproducible from the beginning. Even if some quickly hacked scripts save you some time to get to a result, you’ll lose a lot more time and money downstream figuring out if that result is true and debugging it. I like the reproducibility debt/technical debt metaphor from Titus Brown, and I think right now the reproducibility debt is crazy high in science. I will try to keep that debt low in my PhD.

What would you like to see from this workshop?
  • What other people that share the same goals as mine think and are doing.
  • Great discussions about these topics.
  • New ideas on how to accelerate the move to open and reproducible science and what other tools/projects out there are contributing to this?
  • How can we all work/collaborate on this?
  • Lightning talks to introduce everybody.
  • All the output from this on the web. Talks, documents (start with etherpad/hackpad/gdocs then move to GitHub).
  • During this workshop, if we make everything public and grab enough attention in social media, maybe it would cause a strong impact if we had a short online public discussion with people that couldn’t attend but would like to give their opinion (perhaps on the last day on Gitter / IRC / Hangout / whatever channel).

Conclusion

These are some of my thoughts. Some I strongly believe (e.g. reproducibility, open source) others might just be some crazy ideas. What do you think?

Answers me with another blog post or ping me on Twitter (@bmpvieira).