NGS Data Processing

Ana Fekonja
Inno Bytes
Published in
4 min readMay 10, 2024

Next-generation sequencing (NGS) is a powerful tool that generates vast amounts of DNA or RNA sequence data, revolutionizing fields within the life sciences such as biology and medicine. Life scientists have a variety of choices for experimental setups, kits, sequencing providers, and data analysis methods. Let’s explore the best approaches to NGS data analysis and examine the recent trends in this dynamic field.

The complete NGS workflow includes sample preparation, library preparation, sequencing, data analysis, and exploration. For library preparation and sequencing, a wide array of kits and providers are available. However, producing raw NGS data is just the initial phase. This data must be processed accurately to convert it into a human-readable form. For RNA sequencing, the typical data processing involves steps like trimming reads, quality control (QC), alignment, quantification, differential gene expression (DGE), analysis, and presentation, all facilitated by specific bioinformatics tools.

Due to the complexity of NGS data, the past two decades have seen the development of numerous tools to facilitate its processing. Some tools excel in aligning reads to a reference genome, while others are adept at identifying genetic variants or quantifying gene expression. Most of these tools are open-source and widely accessible, covering various use cases. However, certain specialized or high-performance tools developed by biotech companies may not be open-source.

So, what is the best approach to NGS data analysis?

Should you opt for open-source tools and software packages, create a custom workflow and process it locally, or choose off-the-shelf solutions from licensed software or platform providers? The decision hinges on several key factors:

  • the volume of data to be processed (small scale vs. large scale),
  • the location for data storage (both raw and processed),
  • the type of NGS data, and
  • the level of expertise required to conduct the analysis.

For instance, in internal small-scale data processing, it’s often practical to use individual tools and software packages locally, particularly if you’re comfortable with the command line. This approach allows access to various tools and software packages via GitHub, or the ability to build your own pipelines, eliminating the need for licensed software for analysis.

However, this requires significant bioinformatics skills, including setting parameters, validating outputs on test samples, extracting information, and using visualization tools. Most life scientists aren’t trained in these areas, leading them to depend on existing software and platform solutions from sequencing providers or biotech companies that offer built-in and validated pipelines for more automated data processing and exploration.

Conversely, software or a platform is nearly essential for large-scale NGS data processing due to several key factors:

  • Processing Speed: Some tools need substantial computational power. Cloud-based solutions leverage distributed computing to expedite data processing, with proprietary technologies like the Illumina DRAGEN Bio-IT Platform designed for high-throughput processing.
  • Data Storage: Data can accumulate quickly in terabytes, challenging local storage, especially with both raw and processed data (e.g., BAM files). Cloud platforms, such as Amazon AWS, Microsoft Azure, or Google Cloud, provide scalable storage solutions.
  • Built-in Pipelines: Platforms support NGS data processing with regularly updated, validated pipelines (e.g., RNA-seq, DNA-seq, WES, WGS, ATAC, Chip-seq) that cater to most NGS data. Advanced users can also tweak parameters to uncover additional insights.
  • Data Exploration: Interactive tools are essential for analyzing data, such as PCA plots, heatmaps, Venn diagrams, differential gene expression, and volcano plots. These platforms often allow access to external databases for further information on specific genes, variants, or ontologies.

Trends in NGS Data Processing

Current trends in NGS data analysis involve cloud-based platforms, machine learning (ML), and artificial intelligence (AI). Cloud platforms are particularly advantageous for storing and analyzing large datasets without costly hardware. ML and AI can uncover patterns and insights that might elude human analysis.

Why do major companies like Illumina, Qiagen, and Bio-Rad develop proprietary NGS data processing solutions?

Several reasons contribute to this choice. Primarily, these solutions enhance their sequencing products and services. They also help protect intellectual property and ensure product differentiation from competitors. Moreover, these companies often possess specialized expertise in data analysis or machine learning, which they aim to incorporate into their software solutions. Proprietary systems also offer the necessary customization and flexibility that generic software might not provide for specific research needs.

However, the development of custom software entails significant costs, including development, maintenance, and the need for interdisciplinary expertise. Clearly defined use cases (such as internal NGS data processing or project-specific services) and calculated ROI are essential, especially since academia often cannot afford such custom solutions. For NGS data processing software, an interdisciplinary team with software development skills and a deep understanding of biology and bioinformatics is vital to provide life scientists with optimal solutions.

Therefore, if you decide to pursue a custom NGS data processing software solution, select a company experienced in the technology. Understanding your specific requirements will also help your software partner provide more accurate cost and time estimates for the project.

If you are looking for a custom NGS data processing software solution, speak with BioSistemika experts. Their scientists and software engineers have vast domain knowledge and experience in NSG data processing.

--

--

Ana Fekonja
Inno Bytes

Biotech enthusiast with a vision to digitize laboratories. Combined expertise in biotechnology and economics for impactful advancements.