Fastp- Streamlining Next-Gen Data Quality Control in Miniconda Environment

Prabhanjan
7 min readFeb 5, 2024

Unlocking the Power of Clean Data: Introducing Fastp for NGS Preprocessing

Image credits: hsc

Have you ever built a magnificent sandcastle, only to see it crumble under the waves due to weak foundations? In the world of bioinformatics, similar woes can arise from low-quality Next-Generation Sequencing (NGS) data. Just like the foundation determines the strength of your sandcastle, data quality control sets the stage for robust and reliable downstream analyses.

Here’s where Fastp, your friendly neighborhood NGS data-cleaning superhero, swoops in! This powerful tool acts as an all-in-one preprocessor, meticulously trimming adapters, filtering out low-quality sequences, and generally ensuring your data is pristine and ready for further exploration. Think of it as the magic wand that transforms messy scribbles into a clear, readable manuscript.

But before we unleash Fastp’s superpowers, let’s create a dedicated workspace called a Fastp Miniconda environment where it can thrive.

Stay tuned as we explore the simple steps to get your Fastp adventure started!

Ready to Set Up Your Fastp HQ?

Let’s create its dedicated workspace: a Miniconda environment. Think of it as Fastp’s cozy office, where it can focus on cleaning your data without distractions from other tools or projects.

Here’s a quick guide:

  1. Build the Office: Open your terminal and using Miniconda type this command, followed by Enter:
conda create -n fastp

2. Hang the “Open for Business” Sign: Once the office is built, activate it with this command:

conda activate fastp

Fastp’s Grand Opening: Installation Time!

Fastp’s office is ready, but it needs the right tools to do its job. Let’s unpack those tools with a quick installation process:

1. Call the Package Delivery:

Type this command into your terminal (still within the (fastp) environment) and press Enter:

conda install -c bioconda fastp

2. Unpack the Toolbox:

This command fetches Fastp and its essential dependencies from a special warehouse called the bioconda channel, known for its wide selection of bioinformatics tools.

Decoding the Delivery Instructions:

  • conda install: This is like saying "Bring me the package!" in Conda language.
  • -c bioconda: This specifies the warehouse where the package is stored.
  • fastp: This is the name of the package you're requesting.
Images by Author: Installing Fastp

Visualizing the Process:

Sit back and relax while Conda does the heavy lifting. Once the installation is complete, Fastp will be ready to roll up its sleeves and tackle your data-cleaning tasks!

Image by Author: Packages that get installed with Fastp

Ready for a Test Drive?

Fastp is now fully equipped and ready to showcase its cleaning skills! Let’s grab some raw sequence files and put them to the test. Here’s a step-by-step guide:

1. Gathering the Test Subjects:

  • Locate your raw sequence files: These files usually have the .fastq or .fq extension. They might be stored locally on your computer or accessible on a shared server.
  • Understand your file types: Know whether you’re dealing with single-end or paired-end reads. This information will guide you in choosing the appropriate Fastp commands later.

2. Calling Fastp to the Stage:

  • Open your terminal: Make sure you’re still within the (fastp) environment.
  • Type the Fastp command: The basic command structure looks like this:
fastp -i <input_file1> -o <output_file1>

Replace <input_file1> with the actual path to your first raw sequence file, and <output_file1> with the desired path for the cleaned output file.

Showtime for Fastp: Preprocessing a Paired-End Dataset

Fastp is eager to demonstrate its cleaning prowess! Let’s witness it in action with a paired-end dataset, SRR12486989.

1. Fetching the Dataset:

2. Unveiling the Mystery Sample:

  • Before starting, click here: to uncover fascinating details about the sample we’re preprocessing!

3. Fastp’s Paired-End Performance:

  • Now, let’s witness Fastp’s wizardry with paired-end reads. Type this command into your terminal, replacing file paths if needed:
fastp -i SRR12486989_1.fastq -I SRR12486989_2.fastq -o out.R1.fq.gz --out2 out.R2.fq.gz

Breaking Down the Command:

  • -i SRR12486989_1.fastq: Guides Fastp to the first file containing forward reads.
  • -I SRR12486989_2.fastq: Directs Fastp to the second file containing reverse reads.
  • -o out.R1.fq.gz: Instructs Fastp to store the cleaned forward reads in this compressed output file.
  • -o out.R2.fq.gz: Tells Fastp to save the cleaned reverse reads in this compressed output file.

Hold tight as Fastp meticulously processes the files, conducts quality checks, and delivers a pair of pristine output files ready for further analysis!

Fastp’s Transformation: Unveiling the Data Cleaning Magic

Ready to peek behind the curtain and see how Fastp transforms your raw data into pristine reads? Let’s dive into the key steps and metrics involved:

Before Filtering: The Raw State

  • Raw Reads: These are the unedited sequences straight from the sequencing machine, containing both high-quality and potentially problematic reads.
  • Raw Bases: The total number of bases in the raw reads, including those that might be of low quality or contain unwanted artifacts.

After Filtering: The Polished Gem

Fastp meticulously filters and polishes the raw data, resulting in:

  • Filtered Reads: Only the reads that meet Fastp’s quality standards are retained. Reads with excessive errors, adapter contamination, or other issues are discarded.
  • Filtered Bases: The total number of bases remaining in the filtered reads, representing a cleaner and more reliable dataset.
  • Q20 Bases (After Filtering): The number of bases with a quality score of 20 or higher, indicating a high probability of being correct.
  • Q30 Bases (After Filtering): The number of bases with an even higher quality score of 30 or above, representing exceptional confidence in their accuracy.
Image by Author: Fastp Results within 53 seconds
Image by Author: Fastp Processed output files @json and HTML files

Bringing It All Together: A Visual Guide to Fastp’s Filtering Process

Image by Author: Fastp report

Let’s understand Fastp’s cleaning process step by step:

  1. Adapter Removal:
  • Fastp scans for adapter sequences (short DNA segments used during sequencing) and precisely trims them from the read ends, like a skilled tailor removing unwanted threads.

2. Quality Filtering:

  • Fastp scrutinizes each base’s quality score, which reflects its likelihood of being correct. It identifies and discards bases with scores below a designated threshold (often Q20 or Q30), ensuring only the most reliable data remains.

3. Length Filtering:

  • Short reads, often resulting from sequencing artifacts, can hinder downstream analyses. Fastp removes reads shorter than a set minimum length, maintaining a high-quality dataset.

4. PolyX Trimming:

  • Sequences with excessive stretches of the same nucleotide (poly-X tails) can cause issues in some analyses. Fastp trims these tails to optimize data quality and compatibility.

Interpreting the Metrics: Your Data Quality GPS

Fastp provides a comprehensive quality report, acting as your GPS for navigating data quality:

  • Total Reads and Bases: These indicate the initial amount of data, but don’t guarantee quality on their own.
  • Q20 and Q30 Bases: Higher percentages signify greater accuracy and reliability, reflecting successful sequencing.
  • Filtered Reads and Bases: These represent the high-quality reads ready for downstream analyses, showcasing Fastp’s cleaning effectiveness.

Output Files: Your Data’s Story in JSON and HTML

Fastp generates two types of output files to help you visualize and understand your data’s quality:

  • JSON Files: Detailed quality metrics like read counts, length distributions, and per-base quality scores are stored in JSON format, offering a comprehensive look at data caliber.
  • HTML Reports: Fastp summarizes key quality metrics in visually appealing HTML reports, providing a quick and accessible overview of data quality.

Why Quality Matters: The Foundation for Reliable Discoveries

Quality control is crucial in NGS data processing for several reasons:

  • Data Reliability: Ensures accuracy and trustworthiness of results, preventing misleading conclusions based on poor-quality data.
  • Accurate Downstream Analyses: High-quality data leads to more dependable outcomes in analyses like variant calling, assembly, and alignment, building a solid foundation for biological insights.
  • Resource Efficiency: Removing low-quality reads accelerates downstream processing and optimizes computational resources, saving time and computational power.

In conclusion, Fastp in your Miniconda environment empowers your NGS data analysis journey. By ensuring the quality of your data, you pave the way for accurate, reliable, and insightful biological discoveries. Remember, understanding your data from the beginning is the key to unlocking the hidden secrets within the sequences.

Happy sequencing! 🧬🔍✨

Articles and Publications to Refer to for Fastp:

https://academic.oup.com/bioinformatics/article/34/17/i884/5093234

https://github.com/OpenGene/fastp

https://anaconda.org/bioconda/fastp

--

--