It’s safe to say that if you had to use a bioinformatics pipeline, you had a rough ride at some point.
At InSilico DB we run bioinformatics pipelines at-scale on 100,000s of samples. Even though those pipelines are open source and freely available, there’s a significant overhead in running, configuring, documenting their implementation, and updating them.
The folk that gather for example at Genome Informatics Conference at CSHL get all the new data and heroically write scripts to crunch it for the first time. Luckily, bioinformaticians are a very open crowd that publishes their scripts on GitHub. As it trickles down to the users (versus) developers of the pipelines, there remains still a big challenge in compiling these tools and running them on your own environment. There was a recent review of the travails involved compiled from “hackathons and workshops of the EU COST action SeqAhead”.
So, when we started hearing the words bioinformatics and Docker together we started to get really excited. For example:
- Titus Brown at UC Davis, has been playing with Docker to publish his lab’s bioinformatics results
- Paolo Di Tommaso, working on a docker-based workflow management system specialised in bioinformatics called Nextflow
- This paper: An introduction to Docker for reproducible research, with examples from the R environment
We started dreaming about:
- being able to support any type of data if there is a corresponding type of pipeline available would make so many of our users happy
- being completely transparent about the algorithms that were run
- allowing the bioinformatics-inclined user to tweak the pipelines without having to go into the guts of our cloud platform
What do you think, is this the moment when bioinformatics can be distributed to the masses?
My feeling is we’ll learn a lot from the upcoming [Bio in Docker] Symposium, 9–10 November 2015, Wellcome Building, London. If this resonates, would love to meet there. To register: https://t.co/2SCCp6YEIt
Please recommend this post to reach more people!