Why hasn’t bioinformatics been democratized yet? 5 reasons.

Published in

truwl

19 min readApr 28, 2022

“Democratization of bioinformatics.” I’ve seen the phrase many times. The argument usually goes something like this:

“The decreasing cost and widespread availability of genomic sequencing technology has made collecting sequencing data accessible to nearly every lab and clinic, yet the ability to make sense of the data remains a significant bottleneck.”

One of the reason I’m so familiar with these words is that I’ve written similar things myself — although I’ve avoided the term “democratization” and stuck with “accessibility” after reading a Tweet that compellingly explained why that was not the right word.

To complete the story you should probably also show the graph of decreasing genomics cost outpacing Moore’s Law. I included it here for your convenience. We have become very good at collecting data; and when you know how to do something well, you tend to do it a lot: the amount of genomics data being collecting is doubling every 7 to 12 months. That means most genomics data in existence was generated in the last year (Gavin Belson explains something similar in the opening moments here.) Although this seems like a lot — it is — we’re only beginning to scratch the surface.

The rapidly decreasing cost of genomic sequencing. (NOTE: Not very impressive if you make 2015 the starting point.)

As we continue to accumulate data it has become well-recognized that most scientists that have the lab experience to perform genomic assays and collect all of this sequencing data do not have the knowledge and training to use advanced computational tools required to analyze it.

The National Human Genome Research Institute (NHGRI) has been broadcasting this issue in their SBIR/STTR Omnibus Solicitation at least since I started following it 6 years ago:

Some genomic data analysis and display tools have been developed that already are used in the community but would benefit from additional work to support broader dissemination, for example making them efficient, reliable, robust, well-documented, and well-supported, or for deploying them in containers or at scale in a cloud-based platform.

In other words, tools that have already been developed are

hard to find — need dissemination;
under-developed — making them efficient, reliable, robust;
under-supported — well-documented, well-supported; and
require underlying infrastructure to use — deploying them in containers or at scale in a cloud-based platform.

As with all problems, the ‘bioinformatics bottleneck’ doesn’t exist in isolation but is connected to other issues and the status quo of the way we do things. I think of it as part of the larger reproducibility crisis that spans all of science. In 2016, Nature released results of a reproducibility survey. The opening two lines of that publication were bleak:

More than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments.

Not good science. Not good at all. Some of the lack of reproducibility I get. The reward system for scientists does not favor rigor and reproducibility, but novelty and flashy headlines so some results might not even be reproducible in the first place— this has already been covered extensively elsewhere. But there are also technical issues. In biology, researchers don’t have access to identical equipment and biological samples, seemingly minor details are often left out of reported procedures, and some people have developed exceptional skill working with particular biological systems that is not easily replicated. However none of these issues are insurmountable with current technology for computational experiments. Computational experiments provide a unique opportunity for reproducible research. Samples are not limited — data can be copied and shared — yes, there are privacy concerns to deal with. There is no specialized equipment or infrastructure that can’t be accessed with an internet connection, some know-how, and a credit card — it’s called the cloud. And computers can keep track of every individual command that was run — these can be shared in a notebook, found in command line histories and log files, or collected into a workflow which specifies how to execute a series of operations in series and parallel.

Then there is the compute environment. When I was an undergraduate student, I worked in a crystallography lab. My advisor told me stories about scientists that claimed to have ‘magic dust’ floating around in their labs that was particularly good for the nucleation of crystals. Apparently, when they moved labs they couldn’t get certain crystals to grow anymore. They ultimately transferred dust from the old lab to the new lab and their crystal growing powers were restored. The lab for computational experiments is a software environment and software environments — or complete instructions for making them — can be packaged up and passed around, magic dust and all. These are software containers (e.g. Docker) and package managers (e.g. Conda). Once packaged up, the environments can be moved, unpacked or built, and used in a space with all the resources needed to support them.

computational experiments provide a unique opportunity for reproducible research

In theory, all computational experiments can be made reproducible. And if the reproducibility problem was solved, the bioinformatics bottleneck could be solved too. To make a computational experiment reproducible you need the three pieces above: the samples (data and metadata), the exact procedure (software commands), and the lab (compute environment). As I wrote about recently, reproducing someone else’s experiment is the precursor to knowing that you are consistently applying methods before being able to confidently apply them to new datasets. Once you reproduce someone else’s experiment, you have everything you need to do your own similar experiment. The pieces needed to reproduce someone else’s experiment are the same pieces that are needed to do your own .

There are several platforms aimed at democratizing access to bioinformatics. There are many examples that have gone by the wayside; platforms that have been around for a while and are still going strong like Galaxy, DNAnexus, Terra.bio, and Seven Bridges Genomics; and there continues to be a pipeline of newcomers like latch.bio, Datirium, biobox.io, and more. Although these platforms have made great strides in providing access to bioinformatics capabilities, the fact remains that many otherwise capable researchers cannot analyze data on their own.

As part of an STTR award we took advantage of the NIH Technology Niche Assessment Program which pays for a market research report. When looking at the existing landscape of bioinformatics platforms, the report stated there were “Multiple open-source programs, commercial products, and research projects with no clear dominant design.” In other words, there’s a bunch of stuff out there, but there isn’t a solution that has hit the mark and has become the conventional way of doing things. So what are the holdups? Here’s my take on some of the biggest:

Holdup #1: Method curation

Bioinformatics platforms, such as those mentioned above do a lot at lowering the expertise barrier. Doing an analysis from scratch requires a lot of heavy lifting to set thing up such as installing software and gaining access to appropriate compute resources. Many researchers are also unfamiliar with working from the command-line so graphical user interfaces (GUI) are a big help. I think multiple platforms do a good job at simplifying all of these challenges — I have spoken to several biologists who said that Galaxy was a complete game changer and gave them the tools they needed to advance their projects. This all helps with issue #4 of NHGRI’s omnibus solicitation quoted above.

With ready-to-use environments and interfaces for kicking off compute jobs, what is still missing? Sometimes doing a thing isn’t the hard part, but knowing what to do is and that is the case with many bioinformatics analyses. Imagine doing your first ever RNA-seq experiment. You do the library prep, do the sequencing, and you have the fastq files in-hand. Now what? Unless you have someone to provide support, most people will start with a Google search or dive into the literature. In short order they’d learn that there is not a standardized way of doing things so they’ll end up picking a strategy from a source that seems trustworthy — it might be a set of tools to use, a pipeline, or something available on a platform that is used within their organization. Without considerable experience, they won’t know what all their options are, what content is available on what platforms, and how to evaluate what is right for their use case. There isn’t a centralized place to read reviews and see common questions like there is when trying to decide between products on Amazon. It would be nice to see what has worked for others so you can have an idea what will work for you. I’d love to see a list of rated questions such as “Will this work with Oxford Nanopore data” with answers from people who have tried it.

Figuring out what to use continues to be a place where the expertise barrier needs to be lowered. What methods are out there? What is current? What should I use? Is there already work that I can leverage (e.g. availability on a platform, or an existing workflow, or notebook), or do I need to start with a blank canvas?

Holdup #2: Productionization of content

Platforms are all about content. GitHub is useful as a place to backup and collaborate on code with your team (I use GitHub as an example a lot), but what really makes it special is all the content that is there to learn from, use, and interact with. What’s Wikipedia without all the articles? Medium without the stories. The popularity of the R programming language is due to the packages… you get the picture. In the same way, bioinformatics platforms are useless without content and the primary content on bioinformatics platforms are methods and data. I’m focusing on the methods here as I think data access should be handled separately and outside of platforms. Currently, platforms are mostly limited to the technologies behind them, or to a specific research domain. Terra only supports pipelines written in the Workflow Description Language (WDL), One Codex is just for analyzing microbiome sequencing data, BaseSpace is only for Illumina data and so forth. As an aside, shout out to Seven Bridges Genomics for announcing that they are supporting WDL and Nextflow workflows now.

Why are platforms so limited with content, and why should end-users have to keep figuring out different platforms that more or less have equivalent technologies behind them (Web GUI’s that launch jobs on HPC or cloud)? For some platforms, the answer is obvious: the developers and funders make platforms to support their own work — Illumina is not going to dedicate resources to make analyzing PacBio data easier. In other cases, organizations might have plans to support a wider range of content but just get busy enough with their starting points that they never get there. Have you ever started a project with parts A, B, C, and D, but find that there are precursors to even make part A possible?

We need to have a universal way to access bioinformatics content that is consistent across topics, workflow languages, and technologies so we don’t have to figure things out from the start every time we want to do something a little different from what we’ve done before. And why stop with genomics? Let’s include mass spectrometry, flow cytometry, and whateverelse-omics-omitry as well.

This holdup is the biggest blocker to bioinformatics ‘democratization’, and tackling it is a monumental undertaking. Remember that most tools are already under-supported and underdeveloped. It takes a lot of work to take methods from the wild, work out some (many) kinks, and put them into a ready to use format. Nothing ever “just works”. This “productionization” (it’s a word, the internet said so) of methods is mostly done by dedicated teams that support particular platforms and frankly, there is too much content out there for them to handle. Every time I open a journal there is a new sequencing method acronym. We need to move from dedicated production personnel to community production of ready-to-use methods that doesn’t add strain to already overburdened researchers and developers. Then everything needs to be deposited into a centralized repository that anyone can access and avoids vendor lock-in. The Galaxy toolshed is a good example for this. I also really like the idea of the GA4GH Tool Registry Service and its reference implementation, Dockstore, but the amount of content needs to grow. How do we do this? Can we define starting and ending data files in an analysis, parse notebooks and terminals and autogenerate a workflow and compute environment definitions? Is doing everything in notebooks — including launching workflows — and sharing those the way to go? Can we make a standardized way to share methods and complete analyses that facilitates the diverse way that people already work? Git works well for sharing code of all sorts. Can we make something as universal for complete analyses? Are there rewards systems that would motivate developers to share their methods in this way? See the very often referenced comic below about making new standardized ways to do things.

It takes a lot of work to take methods from the wild, work out some (many) kinks, and put them into a ready to use format.

Frustration with the incompatibility of current standards leads to even more standards.

Holdup #3: Confidence in results

Imagine that you could easily query all the available methods out there and make and informed decision about which methods to apply to your data. Imagine also that all the methods you could ever want were available in a ready-to-use format where you just had to specify your data, select parameters, and hit ‘go’. Perhaps using the mythical bioinformatics keyboard below (I have completed a clichéd figure hat trick!!) Imagine that you ran the methods on your data and the analysis completed without errors. Would you trust the results? Would you feel comfortable returning a list of variants to a clinician or committing funds to scale up your sequencing program with this workflow or head in a new research direction?

Torsten Seemann’s bioinformatics keyboard, taken from https://twitter.com/torstenseemann/status/433448248921956352?s=20&t=Hzi46fFEfl0bve-1Vx921A

Not being sure you’re doing something ‘right’ can be a significant cause of angst and uncertainty. The Biostars Handbook even has a section that covers how to deal with the anxiety that comes with wondering if you screwed up your analysis. We’ve heard similar things from end-users of bioinformatics pipelines who aren’t aware of all the inner-workings. Even if things seemed to work just fine, they want some extra assurance that it actually did. So how can researchers be confident that the insights they obtain from their data are based in biological reality?

This confidence can be gained in various ways. Getting the same results as somebody else using the same methods is comforting and a good start. Having someone with more experience reassuring you that all is okay is good. Analyzing data in multiple ways and getting answers that lead to the same conclusions is better yet. However the best solution is benchmarking. Doing your analysis with a sample where you know what the results should be. So what do we need? We need more standard samples, more reference data sets, more methods to compare results to reference sets, and better accessibility to all these resources. We need use these standards and resources on a continual basis and get reassuring green check mark — just like we do with software testing. We need feedback not only when things go blatantly wrong, but also when everything is A-OK.

Holdup #4: Access to accessibility

If you work in an organization that has a sufficient budget and teams of bioinformaticians, software developers, and other IT professionals you might enjoy straightforward access to the computational methods you need on a regular basis. You’d also be part of the minority. As mentioned above, centralized access to production ready content is one issue, but access to platforms that host that content, provide features, and make it ready to use is another. As it stands, too many barriers are in the way of accessing the capabilities of platforms. I’m talking about paywalls, account barriers, and anything else that gets in the way of getting straight to content and functionality that is useful. How do you know if something is useful if you can’t get your hands on it and give it a try? Again, GitHub is a good example. You can see all the public content on GitHub without having an account. You can get to that same content from a Google search and don’t have to go through the GitHub home page. Then once you know that GitHub is something useful, you can create an account and do things, all without ever having to talk to a support or sales associate. You can get started in minutes. Can a bioinformatics platform do this? Maybe you can’t do a bunch of stuff for free, but certainly many obstacles could be removed. Part of this is separating security between methods and data. Some platforms control data access and a lot of data needs to be locked down tight, but why put non-proprietary methods behind the same walls? Let’s not put methods behind the same barriers as protected data — exceptions for proprietary methods of course — and let people find them and use them which will ultimately lead to vetting and improving them.

Barrier — Cost and security walls block access to many platforms.

Most commercial platforms only focus on enterprise sales. I get this. Once you have employees to pay, investors to satisfy, and the possibility of not being able to pay your bills in front of you, the focus has to be on generating enough revenue to keep your doors open. I heard a quote from a representative of one platform company that said “there is no money in academia”. Maybe that’s right, maybe that isn’t, but does that mean that any organization that doesn’t have a large budget for bioinformatics should be excluded from access to commercial platforms? Especially when they are built with open-source-software and host open-source-methods? Is profit the only useful contribution that institutions can make to a platform company? Obviously, I think the answer to these questions is “no”. While enterprise sales might be the main revenue driver (anyone that has tried otherwise to this point has pivoted or ceased to exist) for the foreseeable future, I think there is great benefit to providing platform access to as many potential users as possible. Firstly, it’s the right thing to do for the advancement of science, but it also introduces users to the platform who might be in a position to influence purchasing decisions at a later time, helps to find areas of the platform that should be improved, and brings in users who can help grow platform content. Companies aren’t in a position to offer platform access completely free — there are compute and storage costs to cover — but they can make it accessible without large contracts. A lot of contracts currently include platform access and support. Why not allow anybody to access the platform and try to figure things out on their own and just pay for the resources they use? Open source software that has enterprise versions and support are a good model for this.

To make bioinformatics truly accessible, it needs to be accessible to anybody at any scale. There are a lot of projects that only need to analyze a single data set or require infrequent analysis. Currently, these types of projects do not get the support they need, but they should.

Holdup #5: Supporting multiple user types

This is the main topic that came up when we posted a question about what the holdup was to ‘democratizing’ bioinformatics on Twitter.

Platforms are multisided systems; buyers and sellers, creators and consumers, and so forth. To be widely adopted, platforms need to be beneficial to multiple user types. For bioinformatics platforms there are those that develop and maintain methods and those that need to use them. And as stated by “noted curmudgeon” (he’s actually quite pleasant) Jeff Gentry on Twitter, the range of expertise of the user is wide and bimodal. These users have different needs and wants for bioinformatics platforms and infrastructure. Advanced users want to be able to have control over everything, dig in and try different things and scale. People with less computational expertise just want to analyze the data without having to get the equivalent of a computer science masters degree. This is the well-known issue of balancing flexibility and ease-of-use: as you increase one, you decrease the other. Focusing on ease-of-use (like what we did for our community-edition benchmarking workflow) makes a lot of sense for methods once they have been thoroughly developed, tested, and are used repetitively in the same way on similar samples, but is not appropriate when things are in more of a discovery-phase, or sample-types vary.

How is the right balance achieved? Does it make sense for different types of users to have separate platforms, or should single platforms try and meet the needs of multiple user types? It’s hard to accommodate everyone, but whose needs need to be met more urgently? From my experience, it’s the advanced users. Even if the primary end-user will be less experienced, it’s still the experts that guide decisions of what to do and what to invest time and money in — either directly or through their opinions to friends and colleagues. So can a platform be both flexible and easy to use and understand? I think so, and there are already some solid strategies out there to facilitate this. Providing both programatic (API) and GUI access is key. Programmatic access allows for uninhibited access and control of everything a platform and infrastructure stack has to offer, so advanced users can pull on all the strings and interact with content to their heart’s content. Options can be limited from the GUI to create a less intimidating experience for less experienced users. And why not make the GUI configurable as well? While in development or testing things out, you might want to select different tools, request different amounts of resources, and have a lot of flexibility. But once things move to a production environment a lot of that flexibility that might be available in the underlying programs should be taken away, both for ease of use and for consistency of implementation. This really isn’t that hard to do as we’ve built out interfaces that can vary based on user profiles.

What does solving the bioinformatics bottleneck look like

Until a researcher with enough expertise to design an experiment and collect data, and a clinical staff member that runs a sequencing gene panel can confidently get results from their data without putting in months or years of effort in setting up compute environments and figuring out how to handle the data and gain insights from it there’s work to be done.

What does the world look like if bioinformatics can be done by anybody? It allows genomics to continue its expansion into every organization and industry that can benefit from more knowledge about living things and their products. Here’s a few of those areas:

Education. Learning about genes and inheritance has been part of a solid K-12 education for a long time. With the advent of lower cost sequencing methods and access to the right bioinformatics tools tomorrow’s classroom will go way beyond rolling dice to determine eye color and extracting DNA from strawberries. Instead of just extracting DNA, students could take strawberries with different phenotypes, sequence them and try to understand the source of the differences. The possibilities are endless. This shift will make genomics a much more familiar concept to members of society.

Research. Modern life-science and biomedical research has become a data focused and driven. This transition brings a lot of power to biology, but also a lot of computational overhead and need for additional expertise. By reducing these needs and overhead researchers can harness these powers without the additional headaches and will allow researchers to focus time on scientific questions instead of processes. All this will lead to more organisms studied, more biological insights uncovered and ultimately more therapies, more effective therapies and rapid growth of the entire bioeconomy.

Clinical testing: Genomic testing has been proven as a cost-effective means to improve clinical outcomes. However, testing has been underused and access to bioinformatics is a major barrier to making genomic testing ubiquitous. Uninhibited access will help usher in precision medicine at scale.

Consumer/patient genomics: More people are getting access to their (and their pet’s) genomic and other health data than ever before. There is nobody more motivated to learn from genomic data than a family member or person with a serious health condition. While professionals will be the first to query the data, accessible bioinformatics will allow individuals to dig even deeper — try the newest algorithms, look at different variant types, rerun pipelines whenever references and databases are updated, and more.

Summing up

New computational approaches are always needed and will remain the domain of experts, but applying existing approaches can and should be made much more accessible to anybody that wants to use them. There is a lot to learn by applying existing approaches to different cell lines or species, to more individuals, samples perturbed in different ways, and verifying reported results. In turn, this can also free up experts to push boundaries of bioinformatics rather than support existing methods. People have argued that basic programming and command line literacy needs to be a part of required training for everybody in bio-related fields. I agree that this is part of the solution, but how much expertise should we expect everybody to have in this area? How many advanced fields can we expect people to become experts in, especially as the knowledge and variety of each subject continues to grow? Can there be great careers for people that love biology but don’t care to sit at a computer?

As always, one of my main reasons for writing is to elicit feedback. What am I missing? What did I get wrong? My focus here has been on secondary analysis, since that has been our primary focus at Truwl. Another area that needs considerable attention is integration of secondary analysis with systems to warehouse and interactively explore results, but I’ll leave that for another post.

TLDR: Beyond GUI’s and ready to use compute environments, there needs to be easier ways to see what is available and determine the right methods for particular use cases — knowing what to do is becoming more important than knowing how to do it. There needs to be straight-forward, universal, platform independent way to deposit and access the complete inventory of available methods, independent of underlying programming languages and technologies. Methods need to be openly-accessible and findable with search engines. There needs to be a balance between ease of use and flexibility when accessing methods. Truly accessible bioinformatics will allow scientists and clinicians to spend more time focusing on important questions.