Open Source, Common Workflow Language, Containers, & War of the Package Managers
Note: I’ve made an attempt to link to relevant sites for further info, but this article does assume familiarity with technical terms associated with bioinformatics software engineering.
BOSC — the Bioinformatics Open Source Conference — kicked off this morning, as one of the COSIs (Communities of special interest) associated with ISMB/ECCB. Overall, popular themes included open source (unsurprisingly!), reproducibility, containerisation, package management, and a love/hate relationship with Docker.
Common Workflow Language
After the intro sessions, we entered straight into a set of talks on Common Workflow Language (CWL) — a specification that has recently been growing steadily more popular. Tools from the Rabix suite included visualisation libraries like CWL-SVG and TypeScript development libraries such as the CWL-TS.
Global Alliance for Genomics and Health
While “GA4GH” might sound eerily like a Klingon delicacy, the Global Alliance For Genomics and Health produced quite a few interesting tool talks: the Tool Registry Service paves the way for sharing containerised workflow tools, the DockStore provides infrastructure for them, the Task Execution Schema helps manage bulk tasks, akin to
docker run, and GA4GH/DREAM aims to try packaging and re-running containers of all types on all platforms, hoping to build a truly bullet-proof reproducible stack. If you’d like to help with this, join the GA4GH/DREAM challenge yourself!
Morning Lightning Talks
The morning lightning talks covered Kevin Sayers’ talk on NextFlow/CWL integrations — NextFlow is a bioinformatics workflow language designed to be human-readable and concise, while CWL is more machine-oriented. Stian Soiland-Reyes reminded us that metadata is extremely difficult to generate, because it’s boring, and suggested that providing visualisations such as Common Workflow Language Viewer might be an appropriate incentive. Rephrasing in my own words here:
Single-cell epigenomic reproducible workflows are also apparently an area that doesn’t have a lot of work in it, but Kieran O’Neill introduced Epigenomics-SCREW as a possible solution.
Next up came a talk on BioThings Explorer, which integrates genomic data on the fly using APIs and json-ld. Integrating disparate data via its identifiers, which may be labelled differently between datasources, is a tricky problem, so this short talk seemed very impressive.
Anil Thanki introduced discovery of homology and gene families using Galaxy tools such as GeneSeqToFamily and Aequatus. The last talk of the morning was on YAMP, by Alessia Visconti. “Yet another Metagenomic pipeline” addresses usability & difficult setup issues in other tools, using NextFlow and Docker to get there.
Lunch and BoFs
Developer tools and libraries for open science and reproducibility
After the break, Phil Ewels introduced us to MultiQC, a tool designed to generate static HTML reports from common biological data formats, born from the frustration that command lines shouldn’t be the only way to analyse data. MultiQC is a python command line tool that doesn’t run pipelines, instead analysing the output from pipelines that have already been run. It’s designed to work offline if no internet is available, and uses highcharts for interactive data visualisation, or can be configured to generate matplotlib static images for scenarios where the data are so big they’d likely cause an interactive browser diagram to grind to a halt. Future plans include expansion to MegaQC, analysing across multiple data sets in a broader way.
More lightning talks
Aditya Bharadwaj talked about a network visualiser called GraphSpace, Timothy Booth discussed detecting well-hopping duplicate reads, and Monther Alhamdoosh discussed ways to detect relevant genes when you have really big datasets to explore, using the EGSEA R package.
Brad Chapman eloquently introduced bcbio, designed to take analyses to the data files, rather than copying data around.
Before coffee break, Kees van Bochove described The Hyve’s worldwide open source work. The Hyve is also a BOSC sponsor.
Data Science and Visualisation: The package manager wars
Following a nice half-hour re-caffeination break, we entered the data science and visualisation section of the afternoon. Here lay many differences of opinion with regards to the best way to create reproducible software, the best package managers, and whether or not Docker is the solution to reproducibility (it’s not), or indeed even necessary.
Guix, which I learned was pronounced “geeks”, was introduced by Ricardo Wurmus as a competing package manager solution to Bioconda, that doesn’t require Docker (but can use it if you really wish).
Kei Ono discussed Cytoscape, which grew from an early single Java visualisation package to become an entire ecosystem of related tools, including NDex, which he described as a “GitHub for biological networks”.
Olga Vrousgou introduced the SPOT ontology toolkit, providing all your biological ontology lookup and data mapping needs. When Olga asked who knew about and used ontologies, hands went up across the room — it was definitely a popular subject!
Last talk before the keynote was a BioPython update. As shown above, they have a great new logo, created to loosely mimic the Python software foundation colours with the foundation’s permission. Project updates include the plan to move from their old custom licence to the 3-clause BSD licence — a long-winded task as all past contributors must be contacted to confirm their acceptance — as well as the fact that they’ll eventually be discontinuing support for Python 2.7 by 2020.
Open Source Yourself: the keynote we were all waiting for
Madeleine Ball’s talk was inspiring. Her project, Open Humans, aims to break down barriers to gaining useful research data by open-sourcing genomes and personal health data. She discussed the concerns around sharing these types of data — whether it’s discovering you have a dangerous disease, perhaps one of your ancestors wasn’t actually your grandma, or something else — there are lot of reasons people might prefer to keep this data a secret.
Nonetheless, a surprising number of people do share anyway. Open Humans makes sure that people who share their personal health data know what they’re getting into when they share it, making people complete a quiz that proves they understand the potential risks.
She also shared the impressive story of Dana Lewis, who hacked her own artificial pancreas system in order to create a more effective glucose monitor alarm — one she couldn’t accidentally sleep through when the built-in one was too quiet! She went on to build a community around this as well: OpenAPS.
If OpenHumans sounds exciting to you, consider applying for their small grant project for ideas up to $5000, or apply for their software developer vacancy.
That’s it for tonight folks! More after tomorrow’s talks.
Thirsty for more? Check out BOSC Day 2
Disclaimer: Any views expressed are my own, not necessarily those of PLOS.