NGS101: How to process NGS data (WES/WGS) from raw reads to annotated variants? — a walkthrough

Zeo Choy
Bioinformatics 101
Published in
6 min readSep 3, 2018

I introduced the command line interface and general steps in my previous post. (Such a shame I was overwhelmed by my project during the last few months, even didn’t write a single word.) Today, we’re going to get our hands dirty to really process raw FASTQ into annotated variants (VCF/MAF).

First of all, let’s download the toy data. As I primarily worked on human data, I suggest to download a sample FASTQ from 1000 Genomes Project (aka g1k), currently deposited on The International Genome Sample Resource (IGSR). Feel free to browse around the site, to get familiar with the data portal.

Screenshot from IGSR Data Portal.

I picked the first sample HG0019. There’re two links concerning Exome:

  1. ftp:/­/­ftp.­sra.­ebi.­ac.­uk/­vol1/­fastq/­SRR099/­SRR099967/­SRR099967_1.­fastq.­gz (7.4Gb)
  2. ftp:/­/­ftp.­sra.­ebi.­ac.­uk/­vol1/­fastq/­SRR099/­SRR099967/­SRR099967_2.­fastq.­gz (7.4Gb)

It’s a convention to suffix pair-end sequenced data with _1 and _2, and both FASTQ files are gzipped (xxxx_x.fastq.gz) to save disk space. Normally, you don’t need to unzip it to get it work.

Still, you’ll need a reference genome for alignment. For human, the latest assembly is hg38 (hg19 and a variation of hg19 — g1k was commonly used too.), and available at UCSC : http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/.

Screenshot: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/

Choose either hg38.2bit (797Mb) or hg38.fa.gz (938Mb), they’re the same except 2bit format saves a little more disk space but require an extra program twoBitToFa to convert .2bit back into .fa. Please remember to download the md5sum.txt also to check the file is downloaded properly and correctly.

Note. md5 checksum is basically a compact fingerprint of each file, the same file shall yield the exact same md5 checksum. It’s important and also a good practice to validate each (large) downloaded file.

Okay, you may go grab a coffee now… ☕️☕️☕️

Step 0: Install

In the mean time, you can install the necessary tools and maybe skim through those boring docs?

  1. FastQC : FASTQ Quality Control
  2. QualiMap : BAM Quality Control
  3. bwa : sequence aligner
  4. samtools : editing/indexing BAM
  5. Picard : manipulating BAM/VCF
  6. Genome Analysis Toolkit (GATK) : base calling, variant calling etc.
  7. freebayes : variant caller
  8. ANNOVAR : annotate variants

Getting yourself familiar with command line could definitely save tons of your time. Here’s some tips:

  • cd stands for change your directory.
  • addingsudo means granting user access, and maybe require for some software.
  • some tools are not pre-compiled, and need to use make and make install to compile and install.
  • export PATH=$PATH:/xxx/xxx/xxx/bin is a way to tell the computer to look for executable programs in /xxx/xxx/xxx/bin directory on top of default path such as /usr/bin and /bin, so you don’t need to cd to that directory every time in order to launch that program.
  • unzip zipped files/tarball before installation.
  • ALWAYS LOOK AT README FIRST! It shall contain detailed installation instruction.
  • it’s a good idea to keep all your tools into a designated folder.

Step 1: Data Cleaning, QC

Adaptor trimming is usually done by sequencing service provider, if you’re not sure, you can check it using FastQC. FastQC will report if there’re any adaptor sequence contaminations in the fastq together with per base quality, GC content etc. It’s crucial to remove adaptor sequence before read alignment, as adaptor sequences will generally introduce mismatches of the reads with the reference genome.

# Adaptor Trimming, Skewer (optional)
skewer -m pe -t 16 $FQ_1 $FQ_2
# FASTQ QC, FASTQC
fastqc $FQ

Step 2: Read Alignment

bwa is almost a gold standard sequence aligner for illumina read alignment task. -t 16 instruct the program to use 16 threads for the task, while $REF, $FQ_1 and $FQ_2 are known as variables, and the meaning shall be self-explanatory with $REF = reference genome fasta, $FQ_1 and $FQ_2 = fastq files. One may pipe the output of bwa to samtools to sort and index the BAM files.

After alignment, it is always good to check the alignment quality and compute the coverage and depth using some BAM QC tools, e.g. qualimap. --java-mem-size=32G sets to allocate 32G RAM for qualimap, while it uses -nt instead of -t as in bwa to indicate the number of threads.

NGS will unavoidably have sequencing duplicates, which means exact same reads were recorded by the sequencer. Picard has a function to remove (mark the read as duplicate instead of remove actually) such duplicates from a BAM.

# Read Alignment, bwa
bwa mem -a -M -t 16 $REF_FA $FQ_1 $FQ_2 | samtools view -Sbhu - | samtools sort -@ 16 - > $SORTED_BAM
samtools index $SORTED_BAM
# BAM QC, qualimap
qualimap bamqc --java-mem-size=32G -bam $BAM -gff $BED -ip -nt 14 -oc $COVERAGE -outdir $OUT_PATH
# Remove Duplicates, Picard
java -Xmx16g -jar $PICARD MarkDuplicates I=$SORTED_BAM O=$SORTED_MARKDUP_BAM M=$OUT_METRICS REMOVE_DUPLICATES=true
# you may want to remove the sorted bam to save disk space
rm $SORTED_BAM

Step 3: Variant Discovery a.k.a. SNP calling

As a trial run, you can choose either tool, FreeBayes or Samtools to call SNP. (In practical, I used at least two methods and intersect the resulting vcfs to minimize calling error.) FreeBayes used a Bayesian approach to detect genetic variants, and is able to call not only SNP but also indels and MNP. bcftools mpileup generates pileup, a text-based format to summarize base call of aligned reads against a reference genome, and pipe to bcftools call to call variants and output as gzipped VCF.

Recently, Google released a deep neural net based variant caller, DeepVariant and reported to achieve an impressive performance. But I haven’t had a chance to implement it, but definitely worth a try. It differs from any traditional (pre-)existing callers, because it uses convolutional neural net (CNN) to read pileup as image.

# Method 1: freebayes
freebayes -f $REF $SORTED_MARKDUP_BAM | gzip > $VCF_GZ
# Method 2: samtools
bcftools mpileup -Ou -f $REF $SORTED_MARKDUP_BAM | bcftools call -vmO z -o $VCF_GZ

It is also a good idea to do some vanilla filter to remove low quality SNP calls. %QUAL>20 && DP>5 removes SNP calls in vcf with quality ≤ 20 and depth ≤ 5, those regions are normally at the beginning and end of a sequenced read which is prone to error such that SNP called may not be informative.

# vanilla filter on vcf
bcftools filter -O z -o $VCF_GZ -i'%QUAL>20 && DP>5' $FIL_VCF_GZ

Step 4: Variant Annotation

Last but not least, you may want to annotate the variatns in VCF, so as to make sense of tons of variants called, e.g. which SNP contributes to amino acid substitution or it is reported to be clinically relevant. One of the commonly used variant annotator is ANNOVAR. It’s a perl based annotator and is bundled with lots of useful features to filter common variants that may not be clinically relevant. Don’t panic, you don’t need to know perl to use it.

Before using ANNOVAR, it recommends user to normalize the VCF first. The below command is the demo from ANNOVAR start up guide, it basically annotate the variants with RefGene and CytoBand information and using ExAC version 0.3 (referred to as exac03) dbNFSP version 3.0a (referred to as dbnsfp30a), dbSNP version 147 with left-normalization (referred to as avsnp147) databases. ExAC is particularly useful to pinpoint rare exonic genetic variants if control sample is not available. dbNFSP gathers non-synonymous single-nucleotide variants (nsSNVs) in human, while dbSNP is a generic (different species) database for genetic variants. You can finally view the output in Excel.

# normalization
bcftools norm -m-both -o $TMP_NORM_VCF $FIL_VCF_GZ
bcftools norm -f $REF -o $NORM_VCF $TMP_NORM_VCF
# variant annotation, ANNOVAR
table_annovar.pl $NORM_VCF humandb/ -buildver hg38 -out myanno -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation g,r,f,f,f -nastring . -vcfinput

Conclusion

This article is not meant to be exhaustive, but hope it can provide a reasonable walk through to guide your first bioinformatic analysis.

--

--

Zeo Choy
Bioinformatics 101

PhD. Interests in Cancer Biology, Bioinformatics/NGS, Deep Learning.