How I Built a Bioinformatics Pipeline: [A BASh-Based Variant Calling and Annotation Pipeline]

5 min readNov 21, 2022

In science and life, humans (and maybe other creatures) always find a way to simplify certain activities in their lives. We often do this to reduce repetitive tasks and the risk of monotonous and boring work. Pipelines have a similar function.

A bioinformatics pipeline is a collection of commands that performs a particular function on biological data and generates the desired result. It passes data through a sequence of algorithms that generates an interpretation based on the intended research.

In this article, I’ll share my experience with building bioinformatics pipelines. Specifically, I’ll share the process I used to develop a variant calling and annotation pipeline that works with normal and tumour nucleotide sequences. The link to the pipeline is at the end of this article.

What is a Variant Calling Pipeline?

A variant calling pipeline takes two or more nucleotide sequences as input and computes their differences. These differences are called variations or variants, hence the name “variant-calling.” The results from this kind of analysis usually exist in variant calling format (VCF), which anyone can open with a notepad on a computer.

For instance, if you’re trying to understand the difference between a healthy gene sequence and one affected with cancer, this pipeline can help compute the differences. It’ll make it easy for researchers to know what region of the sequence is involved and should be targeted for therapeutic development.

Things you Need to Build a Bioinformatics Pipeline

A web server or a PC with Linux OS
Large storage space (not less than 50 GB) and high processing speed (8 processing cores)
Nucleotide sequences (test data)
Reference genome (for comparison)
Access to software packages required for the intended analysis
Knowledge of file creation and management using the command line

Steps Involved in Building a BASh-Based Pipeline for Variant Calling and Annotation

The aim of this project was to compile the steps involved in variant calling for genomic analysis such that all that is required is any starting nucleotide sequence file with an extension “.fastq.gz”. This pipeline should work on sequences from any organism as the user gets to insert the link to download a relevant reference genome (not tested).

PS: All software packages are in bold.

1. Quality control

Quality control, as the name implies, means checking the validity and cleanliness of your starting data. Like any data analysis process, the entire analysis and generated results will have flaws if the initial datasets have issues. To handle this stage, I used the fastqc software package to check for data quality.

The output of this stage is an HTML file that provides information about duplicate content, adapters (additional sequences from PCR during the sequencing process), and Guanine-Cytosine, GC content (the higher the GC content, the more stable the sequence data).

2. Adapter trimming

Adapter trimming means using fastp, cutadapt, or trimmomatic to remove all adapter sequences from the input to protect the downstream processes. Adapters are short nucleotide sequences that are used during the PCR stage of sequencing to initiate the amplification and, subsequently, the nucleotide-reading process.

Trimming is a crucial pre-processing step in variant calling analysis to prevent errors and false results. I used fastp for this pipeline because it was more easily accessible and didn’t require any form of file navigation.

Here’s a sample fastp output in HTML format.

3. Reference genome indexing and genome mapping by bwa

Reference mapping and alignment uses `bwa index` and `bwa mem` to index the reference genome and map it against the sample files. The resulting output is a bam file. This step aims to introduce a control sample that tells the pipeline what we consider normal in the sequences. This way, it’s easy to flag and document any observed differences in the sample data.

4. Aligned reads sorting and indexing

Sorting, indexing, and postprocessing of aligned reads are carried out by different samtools, bamleftalign, and bamtools functions, including `samtools view,` `samtools sort,` `samtools index,` `samtools flagstat,` `samtools rmdup,` `bamleftalign` `samtools calmd,` and `bamtools filter.` These functions remove duplicates, perform left alignments to remove indels (insertions and deletions), recalibrate the datasets, and filter the data based on specified criteria like map quality and base quality.

5. Variant calling by samtools and varscan

Variant calling, the last step in this pipeline, is initiated by `samtools pileup,` which creates the input file for variant calling software (varscan). `varscan` takes the input files and produces files in vcf format, which are then merged by the `bcftools` command. At the end of this stage, all things being equal, you should have a viewable vcf file that tells you about the variants in tested datasets.

This pipeline is my first attempt at a bioinformatics pipeline. It can work on four files at once, as long as they have the `.fastq.gz` extensions and are normal and tumour sequences from the same regions. The analysis should start and run in a working directory containing these sample datasets. Also, a link to download the reference genome sequence with a .fa.gz extension should be ready as the script is set to request user input at that stage.

Finally, all input sequences for the normal samples must have the letter “N” and strings “r1” and “r2”, while the tumour should have the letter “T” and similar strings in their file names. The letters should appear before the strings for pattern recognition and to avoid errors, as in the name of the datasets mentioned above.

In the end, I made a variant calling pipeline with relevant lines of code.

What Makes a Good Pipeline?

Ability to run with little to no monitoring
Presence of conditional statements that navigate through empty files, wrong file types, and other potential errors that could affect the analysis
Flexibility to multiple kinds of data

I’ll admit this isn’t industry standard, but I think it’s worth putting out here for anyone to go through, scrutinise, commend, or critique. Either way, it’s something I’m proud I explored.

Have you ever worked with a pipeline? What do you think?

References (for software):

Andrews, S. (2010). FASTQC. A quality control tool for high throughput sequence data
Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu, fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 01 September 2018, Pages i884–i890, https://doi.org/10.1093/bioinformatics/bty560
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14), 1754–1760.
Li, H., & Durbin, R. (2010). Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics, 26(5), 589–595.
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., & 1000 Genome Project Data Processing Subgroup (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England), 25(16), 2078–2079. https://doi.org/10.1093/bioinformatics/btp352
Derek W. Barnett, Erik K. Garrison, Aaron R. Quinlan, Michael P. Strömberg, Gabor T. Marth, BamTools: a C++ API and toolkit for analyzing and managing BAM files, Bioinformatics, Volume 27, Issue 12, 15 June 2011, Pages 1691–1692, https://doi.org/10.1093/bioinformatics/btr174
Koboldt, D. C., Chen, K., Wylie, T., Larson, D. E., McLellan, M. D., Mardis, E. R., Weinstock, G. M., Wilson, R. K., & Ding, L. (2009). VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics (Oxford, England), 25(17), 2283–2285. https://doi.org/10.1093/bioinformatics/btp373