IDseq: An Open Source Platform for Infectious Disease Detectives
A few weeks ago, along with the Chan Zuckerberg Biohub, we unveiled IDseq, an Open Source cloud-based platform that has the potential to transform scientists’ ability to detect and respond to infectious disease outbreaks around the world. What is the technology behind it?
Advances in genomic sequencing make it possible to bring infectious pathogens into plain sight and monitor their global spread. DNA and RNA are signatures for all of the different viruses, bacteria, fungi or parasites in the world. They can also reveal which drug resistance genes or mutations a pathogen possesses. Because of this, genomic sequencing is the perfect basis for an infectious disease monitoring system in the modern era.
But one hurdle to genomic sequencing is that many research labs around the world lack the computational expertise required to search through vast public databases of microbial genomes or crunch the statistics to distinguish signal from noise. Other labs have the ability to use and develop bioinformatics pipelines, but may lack the compute capacity to process large data volumes in a reasonable time frame. By deploying software engineering expertise in collaboration with academic partners who have already published effective workflows, IDseq hopes to make these advances accessible around the globe as a scalable, performant, and easy-to-use platform product. The founding team includes engineers who, a decade ago at Facebook, built large-scale performance profiling systems and oversaw growth from a few million users to over a billion.
Optimizing performance of large-scale genomic searches
To build the first version of IDseq, we had to figure out the best ways to efficiently transfer and process terabytes of data per user request. Indeed, current DNA sequencers produce several terabytes of data in a single run. Users may be quiescent for several days, so we scale our elastic compute cloud up and down to reduce costs. But there are also times when IDseq is hit with the output of many sequencing runs all at once, for example, in an outbreak situation requiring rapid decision-making. In such a case we need to quickly spin up a large group of compute nodes and load a genomic search index approaching half a terabyte into random-access memory on all of them, before being able to comb through all of the user-provided data.
In order to achieve transfer speeds of over 1 Gigabyte/second on standard cloud infrastructure, we wrote an Open Source command-line utility called s3mi. The s3mi tool aimed to optimally download parts of a large file from AWS S3 storage to EC2 compute instances in concurrent threads, thus taking maximal advantage of available bandwidth. But disk I/O limitations were hindering the transfer speeds we were seeking, so we added a mechanism to receive the file entirely in memory. This is achieved by tweaking virtual memory parameters so that writes to disk are deferred indefinitely and the downloaded file in reality sits in the filesystem cache in RAM. With this trick, we are able to load a 430 GB genomic index from remote storage in less than 10 minutes, achieving better performance and flexibility than if we attached prepopulated EBS volumes. Interestingly, we had to disable the Linux kernel defragmenter to ensure consistent performance over time while most RAM is occupied by this large index.
While 10 minutes is a manageable timeframe for our application, it would be wasteful to postpone data processing until all reference indexes are fetched. That’s why we overlap index transfer with preliminary processing steps such as quality control and cleanup of the user-provided data. In order to abstract out the complexity of scheduling concurrent processing steps and file fetches, we wrote idseq-dag: a reusable, light-weight, python-based framework for efficiently running any workflow that can be described as a Directed Acyclic Graph (DAG). If you’re interested, you can take a look at the steps we’ve implemented to remove human sequences and analyze pathogen sequences in IDseq. We hope that eventually, the bioinformatics community will start to openly collaborate on this code. But if you’re not interested in bioinformatics, you can ignore the specific pipeline steps we’ve implemented and fork the DAG engine for your own application. The combination of idseq-dag and s3mi should be useful to developers of intricate workflows that require frequent transfers of large files.
Version control for bioinformatic reference data
The infectious disease world never stands still. New genomic sequences are discovered and deposited to public databases hosted by the National Center for Biotechnology Information (NCBI) almost daily. In order to provide the most cutting-edge insights while providing the transparency and reproducibility that science demands, IDseq leverages an in-house synchronization service. Built by one of our technology interns (now full-time team member), this service takes regular snapshots of the NCBI databases and surfaces them through an interface with built-in version control (ncbi-tool-sync, ncbi-tool-server, ncbi-tool-cliclient). Thanks to version control, older analyses can be preserved and reproduced at any time, suitable for their inclusion in scientific journal articles. The same samples can later be viewed through the lens of updated databases, potentially alerting the user to newly discovered pathogens that had been missed before.
In order to prevent regressions as we update our code and databases, we continuously track the sensitivity of IDseq’s pathogen detection pipeline using synthetic inputs with known ground truths (idseq-bench). Our command-line interface (idseq-cli) and web application (idseq-web), which are both Open Source as well, help users upload and visualize their data.
We are just getting started on our journey towards a world where infectious disease is discovered faster and outbreaks are squashed before they become pandemics. It will require us to continue our nascent international collaborations with organizations like the Bill and Melinda Gates Foundation, and to work closely with our users to build the best product and technology to address their needs. We are honored to be working with such amazing scientists and hopeful that our work will make a difference. If you’re interested in joining our team, we’re always looking for amazing technologists to jump in and help build.
Charles de Bourcy, Engineering Lead, Chan Zuckerberg Initiative
Charles works on bringing emerging technologies from prototype to production to global scalability. At CZI, he is the engineering manager for the IDseq platform, a tool to help scientists discover infectious disease in samples. He obtained his Ph.D. in Stephen Quake’s lab at Stanford University, where he developed new computational methods to mine unstructured genomics data for signs of immune dysfunction.