Standard practices for bioinformatics data analysis

You have successfully hurdled the wet lab, generating a treasure trove of information, your sequencing data.

Eric Gathirwa
Jenomu Bioinformatics
4 min read2 days ago

--

You asked for answers from the wet lab, and here they are. But not quite in the form, you need them. You have successfully hurdled the wet lab portion of your experiment, and what you have ahead of you is a treasure trove of information — your sequencing data. However, with great data comes great responsibility in undertaking effective analysis and unlocking its secrets. With various file formats and several GBs of data, we know it can get a bit overwhelming. You should check out our previous articles on Medium, which have been a stepwise high-level guide to this otherwise ‘intimidating’ process.

Photo by National Cancer Institute on Unsplash

1. Quality Checks — The Key to Laying a Solid Analysis Foundation

Before diving headfirst into analysis, it’s common practice to assess the quality of your data. Like bungee jumping, you must check the safety harnesses and secure all critical points. You should validate all materials to ensure robustness and that they can handle the dive. Similarly, poor-quality sequencing data can lead to shaky results. Look for metrics like sequencing depth (how many times a base is sequenced) and base quality scores (indicating the accuracy of each base call). Software like FastQC is your companion for generating these quality reports and building your analysis on a solid foundation. With the growing popularity of third-generation sequencing technologies, a wide variety of open-source tools are also available to ensure your data is of the right quality for the next step.

2. Data Preprocessing — Only the Highest Grade, Please!

Once your data passes quality checks, you must reorganize it before building anything. Preprocessing involves removing adapter sequences (introduced during library preparation), filtering out low-quality reads (sequences with high error rates), and error correction steps to rescue any usable data you would otherwise discard. Trimmomatic and Cutadapt are popular tools for trimming reads, though more alternatives exist. Per our construction analogy, this step ensures we use only the highest-quality materials in the construction process.

3. Alignment — Let’s Find Your Place in the Genome

This step is a crucial part of the process. It involves aligning your reads to a reference genome. The alignment process helps match these reads to their correct positions in the reference genome. You can consider it a contractor following blueprints and architectural drawings for a construction project. You can use tools like Bowtie2 and HISAT2 to create a map that reveals where each sequenced fragment belongs. However, an alternative to this is de-novo genome assembly, a process used when a reference genome for the sequenced sample is either unknown or does not exist. We will tackle that in a subsequent article.

4. Variant Calling — Unveiling the Differences

Once aligned, you can identify variations in your sequence compared to the reference genome. These variations, called Single Nucleotide Polymorphisms (SNPs) or insertions/deletions (indels), can hold valuable information about gene function and disease. Tools like GATK and Freebayes function like the construction authority, meticulously examining your data to identify these variations from the set standards.

5. Downstream Analysis — Making it Make Sense

This step is where the actual exploration begins! Depending on your research question, you might dig deeper into differential gene expression analysis (identifying genes expressed differently between conditions) or functional annotation (understanding the role of identified variants). Tools like DESeq2, EdgeR, and Limma can be used for differential expression, while resources like Gene Ontology (GO) can help with functional annotation. Imagine this as the exciting part of your investigation, where you connect the dots and make sense of the biological significance of the variations you’ve identified.

Remember, this is just a general roadmap, and as vital as it is to understand what is happening behind the scenes, bioinformatics can be a complex field at times conducted by an entire team. Specific tools and analyses will depend on your research question and data type. But with this foundation, you’re well on your way to transforming your raw sequencing data into valuable biological insights! Don’t hesitate to collaborate with bioinformatics experts; their expertise can save you time and ensure your analysis is robust. Additionally, online resources like Galaxy or nf-core offer user-friendly pipelines for various studies.

At Jenomu Bioinformatics, we take the guesswork out of your analysis, ensuring that our team of qualified bioinformaticians and genomic data scientists walk with you step by step to unearth the most out of your data. You can learn more about this at Jenomu.

--

--