A Case for Affordable Next Generation Sequencing

Salyl Bhagwat
5 min readAug 12, 2024

--

The Omniscient Lord of the Universe has blessed our generations with access to knowledge, that has enabled us to gain deeper awareness of subjects in logarithmic times. It has become astonishingly easy to get into the fundamentals of almost any esoteric and multi-disciplinary subject by using current set of learning tools. Remember the adage — Know that, by knowing which, all else can be known? Well, there has never been a better time to actualize this proverb. These tools have empowered us all to be lifelong learners and to do more. My current exploratory project is a testament to this. Let me elaborate.

Few years ago, a family member was diagnosed with an incurable Carcinoma. Cancer had chosen a person who has an unshakable fighting spirit. The word ‘incurable’ didn’t even register with him, so before we knew it, we were part of war against Cancer. It is embarrassing that it took enemy to reach our gates to force our participation. But then, human conditioning is such that we tend to remain oblivious to the urgency of facing common existential threats (such as Cancer) until they become part of our own experienced reality. There were enough opportunities in past several decades to satiate my curiosity about this enigmatic disease, but I procrastinated and then it had become exigency of time to dispel my ignorance of Cancer. It is such a complex and “polymorphic” disease that the patient and family members need to have a good level of awareness to comprehend the conversations with oncologists, surgeons, and palliative care physicians to be able to make sound, informed decisions.

Notwithstanding the magnanimity of this unknown, I turned to Internet search engines to find the most relevant books to draw up battle plans against this formidable adversary. Fortunately, searching the fabric of shared, collective human knowledge excavated two invaluable diamonds — — “The Emperor of All Maladies” and “The Gene: An intimate history” (both written by Siddhartha Mukherjee). These books are seminal work in this form of medical literature and are quintessential for anyone who wants to understand the scope and state of Cancer and modern medicine. It would be ludicrous to summarize the learnings by way of oversimplification for a 3-minute read so let me fast forward to present day.

After a successful surgery and traditional chemotherapy, our protagonist is doing well and remains defiantly vivid while silently motivating us to do more for this cause. He didn’t get to this point without experiencing excruciating pain though that many patients undergoing chemotherapy withstand. That onslaught was the stimulant for my current exploration. Rest of the article will be slightly technical as I describe the path to the project’s motif.

Modern medicine has progressed leaps and bounds in finding targeted therapies for successfully treating many forms of Cancer that are linked to certain gene mutations. It seems imperative to deduce which gene mutations exist in patient’s Cancer cells via Next Generation Sequencing. The oncologist will determine if it makes sense to order it depending on the factors such as patient’s condition, type and stage of cancer etc. Once the test reveals clinically validated gene mutations, doctors can look up if there are any relevant FDA-approved treatments, ongoing clinical trials or any experimental treatments associated with corresponding gene mutation. We were fortunate to have access to NGS, but these diagnostics are expensive and many patients around the globe cannot afford them. This, in my humble opinion, is a critical gap, especially for those with less treatable cancers as it can significantly limit access to potentially life-saving clinical trials. As a result, it just seemed logical to me to explore path to low cost NGS that could potentially play a transformative role for such patients.

A quick analysis of NGS cost structure reveals that the costs could be split into two main categories viz., the clinical platform and the bioinformatics software infrastructure.

i) Clinical platforms costs include clean rooms, specialized equipment, DNA extraction instruments, library preparation kits, sequencer machine (e.g. sequencer from Illumina) etc. that are indispensable for deriving sequence of nucleotides from infected cells.

ii) Bioinformatics software infrastructure costs include sequence analysis software, virtual machines (with many vCPUs for High Performance Computing), scalable resources (to handle varying NGS workloads), Cloud storage (to save massive amount of data produced by sequencer machine), security etc.

The clinical platforms are evolving, and costs should improve over time. Sequencer output already seems to be standardized to use a FastQ file format, potentially unhinging the software cost from the hardware. Next step then obviously was to look for any existing free and open-source bioinformatics software tools to process FastQ files thereby limiting the cost to infrastructure. To my amazement, there are several. I describe one of the sets below that I want to explore to study the automation feasibility:

1) FastQC: A quality control tool for sequencer output (referred to as reads) that generates an HTML report on quality of the reads. This report is then used for trimming, filtering and cleaning the FastQ file.

2) Trimmomatic: A command line trimming tool that outputs in FastQ file format with cleaned reads.

3) Bowtie2: A tool to align sequencing reads to reference sequences. Outputs in SAM format (Sequence Alignment/Map — text-based). Samtools can be used to convert to BAM (Binary Alignment Map — a compressed binary form of SAM file efficient for processing) format.

4) GATK4: Genomic Analysis Toolkit for variant discovery. The output is generated in VCF format (Variant Call Format) that can be used with more annotation and visualization tools. Python scripts (with python packages such as vcf) can be written to extract and interpret information from VCF file to identify chromosome, position etc. to ultimately decipher specific gene mutations. APIs or manual lookups can then be used to get clinical trials info from ClinicalTrials,gov.

There are some obvious impediments to automate this process:

  1. The labs may not give out FastQ output without clinical validation which might require iterations of similar tools. This could perhaps diminish the value of using free software. Partnering with one of the labs or government-funded institutes might help streamline this process by integrating this pipeline into the clinical validation process.
  2. There also seem to be manual steps when using these tools. For example: FastQC’s output must be deciphered to conclude the optimal command line options for Trimmomatic. If we can eliminate manual steps perhaps by using machine learning (ML) for feature extraction and predicting optimal parameters for Trimmomatic via a trained model, then that could solve this problem.

Concluding remarks

Goal then of my current tinkering activities is to conclude if following pipeline is feasible:

I’m very aware that I’m not qualified in the field of medical science (treat that as a disclaimer) but that doesn’t stop me from learning and experimenting from the software side (what can I say, I’m inspired by Mendel!). If this resonates with you and if you have the qualifications and/or skills in this field, then reach out to me via Gammath Works website to collaborate.

The emperor of all maladies has lost a lot of territory, and I think it is only a matter of time that Cancer will be defeated comprehensively. The only (rhetorical) question then is, how many human lives will have to be sacrificed before that endgame…. Now that I have posed that question, are you going to participate in this war or remain a spectator?

--

--