Toolbox Tuesdays: Know Your Bioinformatics Tools

Parcelli
Jenomu Bioinformatics
4 min readAug 20, 2024

Welcome to the first instalment of our educational series, “Toolbox Tuesdays”! This series is designed to equip you with the knowledge and skills to navigate the vast world of bioinformatics software. Each Tuesday, we’ll delve into a specific tool or concept, starting today with quality control (QC) tools.

Ensuring Quality with Confidence

Photo by Myriam Jessier on Unsplash

In bioinformatics, data quality is key, as any errors can lead to misleading results. Imagine building a beautiful house with faulty bricks. That’s what happens when you analyze low-quality biological data. Sequencing errors, adapter contamination, and sample degradation can significantly skew your results. QC tools act as your construction inspectors, identifying any issues before you invest time and resources in downstream analyses.

Quality Check with FASTQC

The FastQC tool, a crucial component in your data quality control arsenal, checks your data and alerts you to any potential issues. Written in Java, this tool can analyze one or more input files in fastq or bam format, providing you with an HTML report that summarizes the results. More information on file formats can be found here.

Data

We will download a small microbiome data from Jacques et al. 2021 adopted from Galaxy Training.

wget https://zenodo.org/record/3977236/files/female_oral2.fastq-4143.gz

Inspecting the data

#Unzip the data
gunzip female_oral2.fastq-4143.gz

#Check the first few lines
head female_oral2.fastq-4143

Run quality check

You’ll need to install fastqc tool if you don’t have it installed already.The basic syntax of running the tool is ; fastqc seqfile1 seqfile2 …

fastqc female_oral2.fastq-4143

Visualize the report

When you list the contents of your directory using ls command you will find an html file that you can view on your browser. Run pwd command and copy the link to your browser to visualize the html report.

The per base sequence quality plot shows the quality of base at each position in the read. On the X-axis is the base positions and on the Y-axis is the phred score/quality score. In the background the graph is divide into:

  • Green — very good quality
  • Orange — reasonable quality
  • Red — Poor quality

A boxplot is drawn for each position, showing the median (red line), inter-quartile range (yellow box), 10% and 90% values (whiskers), and the mean quality (blue line). The higher the quality score the better the base call. Usually a score of above 20 is preferred.

Poor quality of the reads as a result of flow cell issues can be assessed from the Per tile sequence quality plot.A good plot should be all blue.The hotter colors indicate poor quality of reads in a given tile. In this sample, you can see that tile 2105 show consistently poor quality from ~100bp onwards.

Ideally we do not expect to have adapters in our sequences. The Adapter content plot shows the cumulative percentage of reads with adapters. The adapters need to be filtered out using tools like cutadapt. Filtering/trimming of adapters and low quality reads will be discussed in the next blog post.

QC is a vital step in any bioinformatics analysis. By using tools like FastQC you can ensure that your data is of high quality. There exists other quality control tools for both short reads and long reads. Check on the resources below.

Further Resources:

https://github.com/wdecoster/NanoPlot

https://github.com/OpenGene/fastp

Join us at a Bring Your Own Data (BYOD) webinar, where you can learn how to understand your data quality and be linked up to professionals who can help you unearth the most out of your data. This interactive session is a great opportunity to apply what you’ve learned and get expert guidance. You can learn more about this at Jenomu.

--

--

Parcelli
Jenomu Bioinformatics

A bioinformatics enthusiast. I like exploring the exciting world of science.